{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lesson-01 Assignment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 今天是2020年08月16日，今天世界上又多了一名AI工程师 :) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 本次作业的内容"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 复现课堂代码\n",
    "\n",
    "在本部分，你需要参照我们给大家的GitHub地址里边的课堂代码，结合课堂内容，复现内容。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.1 基于规则的语言模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（1）初始语言规则"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:33:41.106088Z",
     "start_time": "2020-11-29T07:33:41.087096Z"
    }
   },
   "outputs": [],
   "source": [
    "simple_grammar = \"\"\"\n",
    "sentence => noun_phrase verb_phrase\n",
    "noun_phrase => Article Adj* noun\n",
    "Adj* => null | Adj Adj*                  \n",
    "verb_phrase => verb noun_phrase\n",
    "Article =>  一个 | 这个\n",
    "noun =>   女人 |  篮球 | 桌子 | 小猫\n",
    "verb => 看着   |  坐在 |  听着 | 看见\n",
    "Adj =>  蓝色的 | 好看的 | 小小的\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:34:23.816804Z",
     "start_time": "2020-11-29T07:34:23.797817Z"
    }
   },
   "outputs": [],
   "source": [
    "import random"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:34:42.826638Z",
     "start_time": "2020-11-29T07:34:42.805650Z"
    }
   },
   "outputs": [],
   "source": [
    "def adj():  return random.choice('蓝色的 | 好看的 | 小小的'.split('|')).split()[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:37:51.727498Z",
     "start_time": "2020-11-29T07:37:51.715507Z"
    }
   },
   "outputs": [],
   "source": [
    "def adj_star():\n",
    "    return random.choice([lambda : '', lambda : adj() + adj_star()])()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:37:57.553436Z",
     "start_time": "2020-11-29T07:37:57.530450Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'好看的'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adj_star()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "问题：语法的局限性，即如果我们更换了语法，会发现所有相关的程序，都要重新写。\n",
    "\n",
    "措施：根据语法描述，生成通用的语法规则"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:44:27.375636Z",
     "start_time": "2020-11-29T07:44:27.370640Z"
    }
   },
   "outputs": [],
   "source": [
    "# （1.1） adj语法描述\n",
    "adj_grammar = \"\"\"\n",
    "Adj* => null | Adj Adj*\n",
    "Adj =>  蓝色的 | 好看的 | 小小的\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:45:05.521229Z",
     "start_time": "2020-11-29T07:45:05.506243Z"
    }
   },
   "outputs": [],
   "source": [
    "# （1.2）根据语法描述 grammar_str 生成规则 grammar\n",
    "def create_grammar(grammar_str, split='=>', line_split='\\n'):\n",
    "    grammar = {}\n",
    "    for line in grammar_str.split(line_split):\n",
    "        if not line.strip(): continue\n",
    "        exp, stmt = line.split(split)\n",
    "        grammar[exp.strip()] = [s.split() for s in stmt.split('|')]\n",
    "    return grammar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:45:33.317962Z",
     "start_time": "2020-11-29T07:45:33.308965Z"
    }
   },
   "outputs": [],
   "source": [
    "# （1.3）根据语法描述 adj_grammar 生成语法规则 grammar\n",
    "grammar = create_grammar(adj_grammar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:46:10.953210Z",
     "start_time": "2020-11-29T07:46:10.932220Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Adj*': [['null'], ['Adj', 'Adj*']], 'Adj': [['蓝色的'], ['好看的'], ['小小的']]}"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# adj_grammar的语法规则\n",
    "grammar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:46:48.099115Z",
     "start_time": "2020-11-29T07:46:48.083126Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['null'], ['Adj', 'Adj*']]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grammar['Adj*']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:46:48.593385Z",
     "start_time": "2020-11-29T07:46:48.572399Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[['蓝色的'], ['好看的'], ['小小的']]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "grammar['Adj']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "根据语法规则生成句子"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:49:21.876313Z",
     "start_time": "2020-11-29T07:49:21.858328Z"
    }
   },
   "outputs": [],
   "source": [
    "# （2.1）句子语法描述\n",
    "simple_grammar = \"\"\"\n",
    "sentence => noun_phrase verb_phrase\n",
    "noun_phrase => Article Adj* noun\n",
    "Adj* => null | Adj Adj*                  \n",
    "verb_phrase => verb noun_phrase\n",
    "Article =>  一个 | 这个\n",
    "noun =>   女人 |  篮球 | 桌子 | 小猫\n",
    "verb => 看着   |  坐在 |  听着 | 看见\n",
    "Adj =>  蓝色的 | 好看的 | 小小的\n",
    "\"\"\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:53:09.697535Z",
     "start_time": "2020-11-29T07:53:09.687545Z"
    }
   },
   "outputs": [],
   "source": [
    "# （2.2）根据语句子法描述生成句子语法规则\n",
    "example_grammar = create_grammar(simple_grammar)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:53:18.250119Z",
     "start_time": "2020-11-29T07:53:18.228135Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sentence': [['noun_phrase', 'verb_phrase']],\n",
       " 'noun_phrase': [['Article', 'Adj*', 'noun']],\n",
       " 'Adj*': [['null'], ['Adj', 'Adj*']],\n",
       " 'verb_phrase': [['verb', 'noun_phrase']],\n",
       " 'Article': [['一个'], ['这个']],\n",
       " 'noun': [['女人'], ['篮球'], ['桌子'], ['小猫']],\n",
       " 'verb': [['看着'], ['坐在'], ['听着'], ['看见']],\n",
       " 'Adj': [['蓝色的'], ['好看的'], ['小小的']]}"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# （2.3）生成的句子语法规则\n",
    "example_grammar"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:53:47.722353Z",
     "start_time": "2020-11-29T07:53:47.707368Z"
    }
   },
   "outputs": [],
   "source": [
    "# （3.1）根据（句子）语法规则生成句子\n",
    "choice = random.choice\n",
    "\n",
    "def generate(gram, target):\n",
    "    if target not in gram: return target # means target is a terminal expression #1\n",
    "    \n",
    "    expaned = [generate(gram, t) for t in choice(gram[target])]  #2\n",
    "    return ''.join([e if e != '/n' else '\\n' for e in expaned if e != 'null']) #3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:58:14.618485Z",
     "start_time": "2020-11-29T07:58:14.608488Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'一个好看的小小的篮球坐在一个篮球'"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# （3.2）根据（句子）语法规则生成句子\n",
    "generate(gram=example_grammar, target='sentence')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T07:59:17.435398Z",
     "start_time": "2020-11-29T07:59:17.402415Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "您好我是17187号,您需要喝酒吗？\n",
      "我找找玩的\n",
      "先生,您好我是1号,您需要喝酒吗？\n",
      "俺找找玩的\n",
      "你好我是12号,您需要打猎吗？\n",
      "俺想找点玩的\n",
      "您好我是3号,请问你要赌博吗？\n",
      "我找找乐子\n",
      "小朋友,您好我是8号,请问你要赌博吗？\n",
      "我们想找点玩的\n",
      "你好我是58号,您需要打牌吗？\n",
      "我想找点乐子\n",
      "女士,您好我是3号,您需要赌博吗？\n",
      "我想找点乐子\n",
      "女士,您好我是25号,请问你要打牌吗？\n",
      "我们找找乐子\n",
      "先生,您好我是58848号,您需要打牌吗？\n",
      "我找找玩的\n",
      "小朋友,您好我是1781号,请问你要打牌吗？\n",
      "俺找找乐子\n",
      "女士,您好我是8763号,请问你要喝酒吗？\n",
      "我们想找点乐子\n",
      "您好我是8号,请问你要赌博吗？\n",
      "我们想找点乐子\n",
      "先生,你好我是94号,请问你要赌博吗？\n",
      "我们想找点玩的\n",
      "女士,您好我是68号,您需要打牌吗？\n",
      "我们想找点玩的\n",
      "你好我是33号,您需要打猎吗？\n",
      "俺找找玩的\n",
      "小朋友,你好我是1号,请问你要喝酒吗？\n",
      "我们想找点乐子\n",
      "先生,你好我是5号,您需要喝酒吗？\n",
      "俺找找乐子\n",
      "你好我是14号,您需要打牌吗？\n",
      "我找找玩的\n",
      "小朋友,你好我是4号,您需要赌博吗？\n",
      "我想找点玩的\n",
      "先生,你好我是4号,请问你要赌博吗？\n",
      "俺找找乐子\n"
     ]
    }
   ],
   "source": [
    "# 例子1\n",
    "\n",
    "#在西部世界里，一个”人类“的语言可以定义为：\n",
    "\n",
    "human = \"\"\"\n",
    "human = 自己 寻找 活动\n",
    "自己 = 我 | 俺 | 我们 \n",
    "寻找 = 找找 | 想找点 \n",
    "活动 = 乐子 | 玩的\n",
    "\"\"\"\n",
    "\n",
    "\n",
    "#一个“接待员”的语言可以定义为\n",
    "\n",
    "host = \"\"\"\n",
    "host = 寒暄 报数 询问 业务相关 结尾 \n",
    "报数 = 我是 数字 号 ,\n",
    "数字 = 单个数字 | 数字 单个数字 \n",
    "单个数字 = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 \n",
    "寒暄 = 称谓 打招呼 | 打招呼\n",
    "称谓 = 人称 ,\n",
    "人称 = 先生 | 女士 | 小朋友\n",
    "打招呼 = 你好 | 您好 \n",
    "询问 = 请问你要 | 您需要\n",
    "业务相关 = 玩玩 具体业务\n",
    "玩玩 = null\n",
    "具体业务 = 喝酒 | 打牌 | 打猎 | 赌博\n",
    "结尾 = 吗？\n",
    "\"\"\"\n",
    "for i in range(20):\n",
    "    print(generate(gram=create_grammar(host, split='='), target='host'))\n",
    "    print(generate(gram=create_grammar(human, split='='), target='human'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "输入的语法描述改变，程序不变"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:02:15.114305Z",
     "start_time": "2020-11-29T08:02:15.105316Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "while(lib_libdatabase_info_8_5){/Ndatabase_info_lib_info_student_2=name/N}\n"
     ]
    }
   ],
   "source": [
    "# 例子3：\n",
    "simpel_programming = '''\n",
    "programming => if_stmt | assign | while_loop\n",
    "while_loop => while ( cond ) { change_line stmt change_line }\n",
    "if_stmt => if ( cond )  { change_line stmt change_line } | if ( cond )  { change_line stmt change_line } else { change_line stmt change_line } \n",
    "change_line => /N\n",
    "cond => var op var\n",
    "op => | == | < | >= | <= \n",
    "stmt => assign | if_stmt\n",
    "assign => var = var\n",
    "var =>  var _ num | words \n",
    "words => words _ word | word \n",
    "word => name | info |  student | lib | database \n",
    "nums => nums num | num\n",
    "num => 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 0\n",
    "'''\n",
    "# 根据语法规则生成一段代码\n",
    "print(generate(gram=create_grammar(simpel_programming, split='=>'), target='programming'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:03:13.069531Z",
     "start_time": "2020-11-29T08:03:13.049540Z"
    }
   },
   "outputs": [],
   "source": [
    "# 例子4：格式化输出\n",
    "def pretty_print(line):\n",
    "    # utility tool function\n",
    "    lines = line.split('/N')\n",
    "    \n",
    "    code_lines = []\n",
    "    \n",
    "    for i, sen in enumerate(lines):\n",
    "        if i < len(lines) / 2: \n",
    "            #print()\n",
    "            code_lines.append(i * \"  \" + sen)\n",
    "        else:\n",
    "            code_lines.append((len(lines) - i) * \" \" + sen)\n",
    "    \n",
    "    return code_lines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:04:16.675419Z",
     "start_time": "2020-11-29T08:04:16.665426Z"
    }
   },
   "outputs": [],
   "source": [
    "generated_programming = []\n",
    "# 根据语法描述生成20段代码\n",
    "for i in range(2):\n",
    "    generated_programming += pretty_print(generate(gram=create_grammar(simpel_programming, split='=>'), target='programming'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:04:17.551271Z",
     "start_time": "2020-11-29T08:04:17.534282Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "while(info==info_5_0_2_2_5_5_3_6_1_9_7_2){\n",
      "  database_database=name_info\n",
      " }\n",
      "if(info_1==info_name_database_info_student){\n",
      "  database_database=database_0\n",
      " }\n"
     ]
    }
   ],
   "source": [
    "# 打印20段代码\n",
    "for line in generated_programming:\n",
    "    print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1.2  Language Model "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$ language\\_model(String) = Probability(String) \\in (0, 1) $$\n",
    "\n",
    "$$ Pro(w_1 w_2 w_3 w_4) = Pr(w_1 | w_2 w_3 w_ 4) * P(w_2 | w_3 w_4) * Pr(w_3 | w_4) * Pr(w_4)$$ \n",
    "\n",
    "$$ Pro(w_1 w_2 w_3 w_4) \\sim Pr(w_1 | w_2 ) * P(w2 | w_3 ) * Pr(w_3 | w_4) * Pr(w_4)$$ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "how to get $ Pr(w1 | w2 w3 w4) $ ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:17:00.910502Z",
     "start_time": "2020-11-29T08:16:52.761362Z"
    }
   },
   "outputs": [],
   "source": [
    "import random\n",
    "import jieba\n",
    "import pandas as pd\n",
    "import re\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:17:31.327733Z",
     "start_time": "2020-11-29T08:17:30.068378Z"
    }
   },
   "outputs": [],
   "source": [
    "from functools import reduce\n",
    "from operator import add, mul\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据预处理"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:18:30.088410Z",
     "start_time": "2020-11-29T08:18:23.710349Z"
    }
   },
   "outputs": [],
   "source": [
    "# 读取文件\n",
    "filename = 'sqlResult_1558435.csv'\n",
    "content = pd.read_csv(filename, encoding='gb18030')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:18:50.628273Z",
     "start_time": "2020-11-29T08:18:50.584299Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>author</th>\n",
       "      <th>source</th>\n",
       "      <th>content</th>\n",
       "      <th>feature</th>\n",
       "      <th>title</th>\n",
       "      <th>url</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>89617</td>\n",
       "      <td>NaN</td>\n",
       "      <td>快科技@http://www.kkj.cn/</td>\n",
       "      <td>此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...</td>\n",
       "      <td>{\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"37\"...</td>\n",
       "      <td>小米MIUI 9首批机型曝光：共计15款</td>\n",
       "      <td>http://www.cnbeta.com/articles/tech/623597.htm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>89616</td>\n",
       "      <td>NaN</td>\n",
       "      <td>快科技@http://www.kkj.cn/</td>\n",
       "      <td>骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...</td>\n",
       "      <td>{\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"15\"...</td>\n",
       "      <td>骁龙835在Windows 10上的性能表现有望改善</td>\n",
       "      <td>http://www.cnbeta.com/articles/tech/623599.htm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>89615</td>\n",
       "      <td>NaN</td>\n",
       "      <td>快科技@http://www.kkj.cn/</td>\n",
       "      <td>此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\\r\\n...</td>\n",
       "      <td>{\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"18\"...</td>\n",
       "      <td>一加手机5细节曝光：3300mAh、充半小时用1天</td>\n",
       "      <td>http://www.cnbeta.com/articles/tech/623601.htm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>89614</td>\n",
       "      <td>NaN</td>\n",
       "      <td>新华社</td>\n",
       "      <td>这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n</td>\n",
       "      <td>{\"type\":\"国际新闻\",\"site\":\"环球\",\"commentNum\":\"0\",\"j...</td>\n",
       "      <td>葡森林火灾造成至少62人死亡 政府宣布进入紧急状态（组图）</td>\n",
       "      <td>http://world.huanqiu.com/hot/2017-06/10866126....</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>89613</td>\n",
       "      <td>胡淑丽_MN7479</td>\n",
       "      <td>深圳大件事</td>\n",
       "      <td>（原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\\r\\n@深圳交警微博称：昨日清...</td>\n",
       "      <td>{\"type\":\"新闻\",\"site\":\"网易热门\",\"commentNum\":\"978\",...</td>\n",
       "      <td>44岁女子约网友被拒暴雨中裸奔 交警为其披衣相随</td>\n",
       "      <td>http://news.163.com/17/0618/00/CN617P3Q0001875...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      id      author                  source  \\\n",
       "0  89617         NaN  快科技@http://www.kkj.cn/   \n",
       "1  89616         NaN  快科技@http://www.kkj.cn/   \n",
       "2  89615         NaN  快科技@http://www.kkj.cn/   \n",
       "3  89614         NaN                     新华社   \n",
       "4  89613  胡淑丽_MN7479                   深圳大件事   \n",
       "\n",
       "                                             content  \\\n",
       "0  此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...   \n",
       "1  骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...   \n",
       "2  此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\\r\\n...   \n",
       "3    这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\\r\\n   \n",
       "4  （原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\\r\\n@深圳交警微博称：昨日清...   \n",
       "\n",
       "                                             feature  \\\n",
       "0  {\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"37\"...   \n",
       "1  {\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"15\"...   \n",
       "2  {\"type\":\"科技\",\"site\":\"cnbeta\",\"commentNum\":\"18\"...   \n",
       "3  {\"type\":\"国际新闻\",\"site\":\"环球\",\"commentNum\":\"0\",\"j...   \n",
       "4  {\"type\":\"新闻\",\"site\":\"网易热门\",\"commentNum\":\"978\",...   \n",
       "\n",
       "                           title  \\\n",
       "0           小米MIUI 9首批机型曝光：共计15款   \n",
       "1     骁龙835在Windows 10上的性能表现有望改善   \n",
       "2      一加手机5细节曝光：3300mAh、充半小时用1天   \n",
       "3  葡森林火灾造成至少62人死亡 政府宣布进入紧急状态（组图）   \n",
       "4       44岁女子约网友被拒暴雨中裸奔 交警为其披衣相随   \n",
       "\n",
       "                                                 url  \n",
       "0     http://www.cnbeta.com/articles/tech/623597.htm  \n",
       "1     http://www.cnbeta.com/articles/tech/623599.htm  \n",
       "2     http://www.cnbeta.com/articles/tech/623601.htm  \n",
       "3  http://world.huanqiu.com/hot/2017-06/10866126....  \n",
       "4  http://news.163.com/17/0618/00/CN617P3Q0001875...  "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 前5行数据\n",
    "content.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:19:22.430235Z",
     "start_time": "2020-11-29T08:19:22.411249Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "89611\n"
     ]
    }
   ],
   "source": [
    "# 提取 content 列\n",
    "articles = content['content'].tolist()\n",
    "print(len(articles))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:19:35.407337Z",
     "start_time": "2020-11-29T08:19:35.393346Z"
    }
   },
   "outputs": [],
   "source": [
    "# 正则查找所有字词\n",
    "def token(string):\n",
    "    # we will learn the regular expression next course.\n",
    "    return re.findall('\\w+', string)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:21:06.932599Z",
     "start_time": "2020-11-29T08:21:04.545070Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Dumping model to file cache C:\\Users\\Lee\\AppData\\Local\\Temp\\jieba.cache\n",
      "Loading model cost 2.347 seconds.\n",
      "Prefix dict has been built succesfully.\n"
     ]
    }
   ],
   "source": [
    "# 将第110条语句进行分词并计数\n",
    "with_jieba_cut = Counter(jieba.cut(articles[110]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:21:31.830229Z",
     "start_time": "2020-11-29T08:21:31.810242Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('，', 88),\n",
       " ('的', 73),\n",
       " ('。', 39),\n",
       " ('\\r\\n', 27),\n",
       " ('了', 20),\n",
       " ('们', 18),\n",
       " ('工作队', 16),\n",
       " ('村民', 15),\n",
       " ('收割', 14),\n",
       " ('、', 12)]"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 词频最高的10个词\n",
    "with_jieba_cut.most_common()[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:25:13.545562Z",
     "start_time": "2020-11-29T08:25:13.529575Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'在外国名著麦田里的守望者中作者想要守护麦田里如自己内心一般纯真的孩子们而驻村干部们也在这个炎热的夏天里撸袖子上阵真正做起了村民们的麦田守望者三夏时节不等人你看到了吗不停翻涌起伏仿若铺陈至天边的金黄麦浪中那若隐若现的人影是自治区新闻出版广电局驻和田市肖尔巴格乡合尼村工作队的队员与工作队组织的青年志愿者在这个炎热的夏季他们深入田间地头帮助村民们收割小麦扛起收麦机麦田中的每个人都显得兴致勃勃一天下来就近22亩小麦收割完毕志愿者麦麦提亚森擦去满脸的汗水高兴地告诉驻村队员我们青年志愿者应该多做贡献为村里的脱贫致富出把力工作队带着我们为村里的老人服务看到那些像我爷爷奶奶一样的老人赞许感谢的目光我体会到了帮助他人的快乐自治区新闻出版广电局驻村工作队孙敏艾力依布拉音麦收时节我们在一起6月中旬的和田墨玉麦田金黄静待收割6月14日15日两天自治区高级人民法院驻和田地区墨玉县吐外特乡罕勒克艾日克村工作队与48名村民志愿者一道帮助村里29户有需要的村民进行小麦收割工作田间地头罕勒克艾日克村志愿队的红旗迎风飘扬格外醒目10余台割麦机一起轰鸣男人们在用机器收割小麦的同时几名妇女也加入到志愿队构成了一道美丽的麦收风景休息空闲工作队员和村民们坐在树荫下田埂上互相问好聊天语言交流有困难就用手势动作比划着聊天有趣地交流方式不时引来阵阵欢笑大家在一同享受丰收和喜悦也一同增进着彼此的情感和友谊自治区高级人民法院驻村工作队周春梅艾地艾木阿不拉细看稻菽千重浪6月15日自治区煤田灭火工程局的干部职工们再一次跋涉1000多公里来到了叶城县萨依巴格乡阿亚格欧尔达贝格村见到了自己的亲戚现场处处都透出掩盖不住的喜悦一声声亲切的谢谢一个个结实的拥抱都透露出浓浓的亲情没坐一会儿在嘘寒问暖中大家了解到在麦收的关键时刻部分村民家中却存在收割难的问题小麦成熟期短收获的时间集中天气的变化对小麦最终产量的影响极大如果不能及时收割会有不小损失的于是大家几乎立刻就决定要帮助亲戚们收割麦子在茂密的麦地里干部们每人手持一把镰刀一字排开挽起衣袖卷起裤腿挥舞着镰刀进行着无声的竞赛骄阳似火汗如雨下但这都挡不住大家的热情随着此起彼伏的镰刀割倒麦子的刷刷声响不一会一束束沉甸甸的麦穗就被整齐地堆放了起来当看到自己亲手收割的金黄色麦穗被一簇簇地打成捆运送到晒场每个人的脸上都露出了灿烂的笑容自治区煤田灭火工程局驻村工作队马浩南这是一个收获多多的季节6月13日清晨6时许和田地区民丰县若雅乡特开墩村的麦田里已经传来马达轰鸣声原来是自治区质监局驻村工作队趁着天气尚且凉爽开始了麦田的收割工作忙碌间隙志愿者队伍搬来清凉的水村民们拎来鲜甜的西瓜抹一把汗水吃一牙西瓜甜蜜的汁水似乎流进了每一个人的心里说起割麦子对于生活在这片土地上的村民来说是再平常不过的事但是对于工作队队员们来说却是陌生的自治区质监局驻民丰县若克雅乡博斯坦村工作队队员们一开始觉得十几个人一起收割二亩地应该会挺快的结果却一点不简单镰刀拿到自己手里割起来考验才真正的开始大家弓着腰弯着腿亦步亦趋手上挥舞着镰刀时刻注意不要让镰刀割到自己脚下还要留心不要把套种的玉米苗踩伤不一会儿就已经汗流浃背了抬头看看身边的村民早就远远地割到前面去了只有今年已经56岁的工作队队长李树刚有割麦经验多少给队员们挽回了些面子赶不上村民们割麦子的速度更不要说搞定收割机这台大家伙了现代化的机械收割能成倍提升小麦的收割速度李树刚说不过能有这样的体验拉近和村民的距离也是很难得的体验自治区质监局驻村工作队王辉马君刚我们是麦田的守护者为了应对麦收新疆银监局驻和田县塔瓦库勒乡也先巴扎村工作队一早就从经济支援和人力支援两方面做好了准备一方面工作队帮村里购入了5台小麦收割机另一边还组织村干部青年团员等组成了6支近百人的收割先锋突击队帮助村民们抢收麦子看着及时归仓的麦子村民们喜得合不拢嘴纷纷摘下自家杏树上的杏子送给工作队金黄的麦穗温暖了村民们的心香甜的杏子温暖了工作队员的心麦子加杏子拉近了村民和队员们的心新疆银监局驻村工作队王继发免责声明本文仅代表作者个人观点与环球网无关其原创性以及文中陈述文字和内容未经本站证实对本文以及其中全部或者部分内容文字的真实性完整性及时性本站不作任何保证或承诺请读者仅作参考并请自行核实相关内容'"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 查找第110条记录的所有字词，无空格连接\n",
    "''.join(token(articles[110]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:25:58.469271Z",
     "start_time": "2020-11-29T08:25:54.648614Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "89611\n"
     ]
    }
   ],
   "source": [
    "# 查找每条记录的所有字词，无空格连接\n",
    "articles_clean = [''.join(token(str(a)))for a in articles]\n",
    "print(len(articles_clean))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:26:36.530939Z",
     "start_time": "2020-11-29T08:26:35.527258Z"
    }
   },
   "outputs": [],
   "source": [
    "# 保存到文件\n",
    "with open('article_9k.txt', 'w') as f:\n",
    "    for a in articles_clean:\n",
    "        f.write(a + '\\n')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "分词"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:27:25.589249Z",
     "start_time": "2020-11-29T08:27:25.574258Z"
    }
   },
   "outputs": [],
   "source": [
    "# 定义分词函数\n",
    "def cut(string): return list(jieba.cut(string))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:31:05.447649Z",
     "start_time": "2020-11-29T08:29:43.168201Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "100\n",
      "200\n",
      "300\n",
      "400\n",
      "500\n",
      "600\n",
      "700\n",
      "800\n",
      "900\n",
      "1000\n",
      "1100\n",
      "1200\n",
      "1300\n",
      "1400\n",
      "1500\n",
      "1600\n",
      "1700\n",
      "1800\n",
      "1900\n",
      "2000\n",
      "2100\n",
      "2200\n",
      "2300\n",
      "2400\n",
      "2500\n",
      "2600\n",
      "2700\n",
      "2800\n",
      "2900\n",
      "3000\n",
      "3100\n",
      "3200\n",
      "3300\n",
      "3400\n",
      "3500\n",
      "3600\n",
      "3700\n",
      "3800\n",
      "3900\n",
      "4000\n",
      "4100\n",
      "4200\n",
      "4300\n",
      "4400\n",
      "4500\n",
      "4600\n",
      "4700\n",
      "4800\n",
      "4900\n",
      "5000\n",
      "5100\n",
      "5200\n",
      "5300\n",
      "5400\n",
      "5500\n",
      "5600\n",
      "5700\n",
      "5800\n",
      "5900\n",
      "6000\n",
      "6100\n",
      "6200\n",
      "6300\n",
      "6400\n",
      "6500\n",
      "6600\n",
      "6700\n",
      "6800\n",
      "6900\n",
      "7000\n",
      "7100\n",
      "7200\n",
      "7300\n",
      "7400\n",
      "7500\n",
      "7600\n",
      "7700\n",
      "7800\n",
      "7900\n",
      "8000\n",
      "8100\n",
      "8200\n",
      "8300\n",
      "8400\n",
      "8500\n",
      "8600\n",
      "8700\n",
      "8800\n",
      "8900\n",
      "9000\n",
      "9100\n",
      "9200\n",
      "9300\n",
      "9400\n",
      "9500\n",
      "9600\n",
      "9700\n",
      "9800\n",
      "9900\n",
      "10000\n"
     ]
    }
   ],
   "source": [
    "# 将保存到文件中的前10000行字词进行分词\n",
    "TOKEN = []\n",
    "\n",
    "for i, line in enumerate((open('article_9k.txt'))):\n",
    "    if i % 100 == 0: print(i)\n",
    "    \n",
    "    # replace 10000 with a big number when you do your homework. \n",
    "    \n",
    "    if i > 10000: break    \n",
    "    TOKEN += cut(line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:31:42.923906Z",
     "start_time": "2020-11-29T08:31:41.957653Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('的', 184244),\n",
       " ('在', 47370),\n",
       " ('了', 36722),\n",
       " ('和', 30809),\n",
       " ('是', 30283),\n",
       " ('月', 18711),\n",
       " ('也', 15995),\n",
       " ('年', 15971),\n",
       " ('有', 14714),\n",
       " ('为', 14448),\n",
       " ('等', 14340),\n",
       " ('将', 14060),\n",
       " ('对', 13074),\n",
       " ('与', 12568),\n",
       " ('日', 12322),\n",
       " ('中', 11117),\n",
       " ('中国', 11036),\n",
       " ('6', 10477),\n",
       " ('上', 10192),\n",
       " ('不', 10027),\n",
       " ('\\n', 10001),\n",
       " ('他', 9530),\n",
       " ('都', 9447),\n",
       " ('发展', 8795),\n",
       " ('企业', 8584),\n",
       " ('就', 8537),\n",
       " ('到', 8338),\n",
       " ('市场', 8095),\n",
       " ('但', 7729),\n",
       " ('这', 7658),\n",
       " ('被', 7575),\n",
       " ('从', 7513),\n",
       " ('并', 7412),\n",
       " ('人', 7339),\n",
       " ('后', 7084),\n",
       " ('公司', 6915),\n",
       " ('一个', 6772),\n",
       " ('说', 6703),\n",
       " ('新', 6467),\n",
       " ('表示', 6309),\n",
       " ('要', 6276),\n",
       " ('还', 6245),\n",
       " ('会', 6179),\n",
       " ('个', 6176),\n",
       " ('我', 6141),\n",
       " ('而', 6090),\n",
       " ('进行', 5802),\n",
       " ('我们', 5742),\n",
       " ('记者', 5734),\n",
       " ('以', 5615),\n",
       " ('5', 5569),\n",
       " ('工作', 5135),\n",
       " ('没有', 5000),\n",
       " ('美国', 4840),\n",
       " ('下', 4741),\n",
       " ('更', 4739),\n",
       " ('通过', 4720),\n",
       " ('大', 4704),\n",
       " ('让', 4701),\n",
       " ('可以', 4681),\n",
       " ('经济', 4670),\n",
       " ('时', 4654),\n",
       " ('目前', 4645),\n",
       " ('国家', 4628),\n",
       " ('项目', 4538),\n",
       " ('问题', 4422),\n",
       " ('创新', 4416),\n",
       " ('多', 4410),\n",
       " ('已经', 4391),\n",
       " ('建设', 4373),\n",
       " ('其', 4224),\n",
       " ('自己', 4119),\n",
       " ('投资', 4064),\n",
       " ('已', 4026),\n",
       " ('3', 4008),\n",
       " ('城市', 3921),\n",
       " ('服务', 3842),\n",
       " ('报道', 3818),\n",
       " ('亿元', 3813),\n",
       " ('及', 3812),\n",
       " ('1', 3793),\n",
       " ('成为', 3684),\n",
       " ('相关', 3646),\n",
       " ('向', 3603),\n",
       " ('可能', 3595),\n",
       " ('他们', 3560),\n",
       " ('以及', 3475),\n",
       " ('或', 3447),\n",
       " ('今年', 3426),\n",
       " ('地', 3411),\n",
       " ('其中', 3408),\n",
       " ('于', 3371),\n",
       " ('她', 3349),\n",
       " ('能', 3343),\n",
       " ('10', 3330),\n",
       " ('着', 3327),\n",
       " ('2016', 3310),\n",
       " ('认为', 3295),\n",
       " ('20', 3282),\n",
       " ('称', 3271)]"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 对分词进行计数\n",
    "words_count = Counter(TOKEN)\n",
    "# 词频最高的前100个词\n",
    "words_count.most_common(100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:32:26.270113Z",
     "start_time": "2020-11-29T08:32:25.796697Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x1ff156bbdd8>]"
      ]
     },
     "execution_count": 54,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZEAAAD8CAYAAAC2PJlnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzt3X2QXNV95vHv0y8zo5cRQmKQQCiWEPhFARxgsB2IN8qbazHEQS7jWA5gHLxiCQUbvIVxkdikWGPsxBUCsWUMdgI4G8gCilm/sFWUExlj/MIIbGzAJOARWAKjEQI0IzSv/ds/7mmpNerbPZoZaaTR86nqutP33Hv7nJlWPzrn3HtbEYGZmdl4FKa6AmZmdvByiJiZ2bg5RMzMbNwcImZmNm4OETMzGzeHiJmZjZtDxMzMxs0hYmZm4+YQMTOzcStNdQX2tSOOOCKWLFky1dUwMzuorF+/fktEdDTbbtqHyJIlS+jq6prqapiZHVQkPTeW7TycZWZm4+YQMTOzcXOImJnZuDlEzMxs3BwiZmY2bg4RMzMbN4eImZmNm0Mkx9pHN/JPPxjTadJmZocsh0iOr//kBf7lkV9OdTXMzA5oDpEc5WKBoZHKVFfDzOyA5hDJ4RAxM2vOIZKjVBTDlZjqapiZHdAcIjlKhQLDIw4RM7NGHCI5Wkpi0MNZZmYNOURyZD0Rh4iZWSNNQ0TSTZJekhSSvpHWrUjPRz9WpPINo9b/uOZ4Z0h6XNKApEclnVJTdo6kZyT1S1onaWlN2cWSNkraIek+SfMn9TcxSqkoD2eZmTUx1p7IXaOePwmsqnlsBgaBJ2q2ebCm/CoASW3AvUA7cAWwALhHUlHSwvQ624ArgVOB29N+JwM3A08B1wBnATfsRTv3WrlYYKjinoiZWSNNv9kwIi6XtAS4vGbdZlKwSDoVOBK4MyJ6anbtBr4ZEb01684kC46PRcSaFByfAFYAJwGtwPURcbek04DzJS0DLkz7Xx0Rj0g6G1glaXVE9O99s5srF8WQeyJmZg1NxpzIf0/LL45afwGwTdJmSRelddXhqU1puTEtjx1HWQlYXK9CklZL6pLU1dPTU2+TpkqFAiOVIMJBYmaWZ0IhImkO2XDVkxHx3ZqiW4H3A+eTDXN9qXZ+o/YQaVnvk3q8ZUTELRHRGRGdHR1Nv2e+rnIxewn3RszM8jUdzmriPGAWo3ohEXFd9ec0n/FR4I1kQ1wAx6TlorTsJpsnySur3e+FVDbMrt7KpCsXs3wdrlRo8UlsZmZ1NQ0RSWcBJ6SniyV9BPhORPwncDHwOvDVmu1PBD4N3J+OfwGwA/gpsJVsEv4SSb3ARcAGYB3ZZP1ngKskLQBWAg9FxLOS7iCbk7lO0gPA6WRzMPtkPgSglEJkaDigZV+9ipnZwW0s/8W+kuzDHbLJ71uBMyS9Iz2/MyJeq9l+C1AErk37PQesjIgX0of+uUAfcCNZoJwbESMR8SLZ0Nhc4HPAY6QJ9YhYD1wKLE/HvZ/s7K59Zudwls/QMjPLNZazs1Y0KNboFSkM3t3geA8CJ+aUrQXW5pStAdY0qutkKhXScJbnRMzMcnmwP8euiXX3RMzM8jhEclQn1h0iZmb5HCI5Sqkn4tvBm5nlc4jkqM6JuCdiZpbPIZKjpeSLDc3MmnGI5Nh1dpZ7ImZmeRwiOUq+7YmZWVMOkRy1tz0xM7P6HCI5fIqvmVlzDpEcpYKHs8zMmnGI5Ng5nOUQMTPL5RDJsetiQw9nmZnlcYjkaEk9kcFhh4iZWR6HSA7f9sTMrDmHSA5fbGhm1pxDJMfO4SxPrJuZ5XKI5Ng5nOWeiJlZLodIDs+JmJk11zREJN0k6SVJIekbNevXpXXVx6s1ZW+R9LCkAUlPS3pXTdkZkh5PZY9KOqWm7BxJz0jqT8dfWlN2saSNknZIuk/S/Mn5FdRX9q3gzcyaGmtP5K6c9U8Bq9LjT2vW3wm8GfgoMATcLekwSW3AvUA7cAWwALhHUlHSwvQ624ArgVOB2wEknQzcnF7vGuAs4IYx1n1cCgVRLMghYmbWQKnZBhFxuaQlwOV1ijcD34yI3uqK9IH/VmBNRHxB0g7gK8D7gK1kwfGxiFiTguMTwArgJKAVuD4i7pZ0GnC+pGXAhenwV0fEI5LOBlZJWh0R/eNo95iUCvIV62ZmDUx0TuS/ANskbZP0F2lddQhqU1puTMtjJ7msBCyeYP0bKhcLvneWmVkDEwmRe4HzgHOBXwKfkvTOOtspLet9Gu+LMiStltQlqaunp6feJmNSKsq3PTEza6DpcFaeiPj76s+SjgJuApYDP0qrj0nLRWnZTTaclVfW3qCsu6bshVQ2zK7eyui63QLcAtDZ2TnurkTWE3GImJnlaRoiks4CTkhPF0v6CPAIWWj8K7AD+HOgAjwSEY9Jehz4gKQngEuAXrKeSz/ZPMolknqBi4ANwDrgSeAzwFWSFgArgYci4llJd5DNyVwn6QHgdODOfTkfAlAuyMNZZmYNjGU460qyD3fIJr9vBX4T6AE+DvwdWThcEBGPpu0+CDwN/C3QArw/Il5NH/rnAn3AjWSBcm5EjETEi2Rnec0FPgc8RppQj4j1wKVkPZ1rgfvJzu7ap0rFgi82NDNrYCxnZ63IKbq5wT5PkAVNvbIHgRNzytYCa3PK1gBrGtV1spWKYsgXG5qZ5fIV6w20FAsM+VbwZma5HCINZGdnuSdiZpbHIdJAqeCzs8zMGnGINNDiU3zNzBpyiDRQKvq2J2ZmjThEGigVCz47y8ysAYdIA+WCfJ2ImVkDDpEGfNsTM7PGHCINeE7EzKwxh0gD5WKBId/F18wsl0OkAX8plZlZYw6RBsolz4mYmTXiEGnAt4I3M2vMIdKAbwVvZtaYQ6SBUtE9ETOzRhwiDbT47Cwzs4YcIg2UCgUiYMS3PjEzq8sh0kCpKACfoWVmlqNpiEi6SdJLkkLSN9K6eZK+lda/Lun7kk6t2WdD2r76+HFN2RmSHpc0IOlRSafUlJ0j6RlJ/ZLWSVpaU3axpI2Sdki6T9L8yfs11Fd2iJiZNTTWnshdo57PARYBnwE+C7wduGfUNg8Cq9LjKgBJbcC9QDtwBbAAuEdSUdLC9DrbgCuBU4Hb034nk32n+1PANcBZwA1jbeR4lYvZr8cXHJqZ1VdqtkFEXC5pCXB5zeqNwMkRUQGQ9B7gFEkzI+L1tE038M2I6K3Z70yy4PhYRKxJwfEJYAVwEtAKXB8Rd0s6DThf0jLgwrT/1RHxiKSzgVWSVkdE/3gaPhalFCKeXDczq29ccyIRMVwTIG8A3gysrwkQgAuAbZI2S7oorasOT21Ky41peew4ykrA4vHUf6zKhWw4yz0RM7P6JjSxnnoS3wIGgA/VFN0KvB84HxgEvlQ7v1F7iLSs9yk93jIkrZbUJamrp6encSMaqA5neU7EzKy+psNZeSQdDfwbcCTwroh4oloWEdfVbHcy8FHgjWRDXADHpOWitOwmmyfJK6vd74VUNsyu3spuIuIW4BaAzs7OcXcjdp2d5Z6ImVk9TUNE0lnACenpYkkfAX4IrAWOA/4GOE7SccDXyYafPg3cn45/AbAD+CmwFdgMXCKpF7gI2ACsA54km6i/StICYCXwUEQ8K+kOsjmZ6yQ9AJwO3Lkv50OgZmLdcyJmZnWNZTjrSrIPd8gmv28lO3PquJryO9OjA9gCFIFr037PASsj4oX0oX8u0AfcSBYo50bESES8SHYm11zgc8BjpAn1iFgPXAosT8e9n+zsrn2q5DkRM7OGxnJ21oqcotsa7PbuBsd7EDgxp2wtWQ+nXtkaYE2D15x05VKWsYOeEzEzq8tXrDdQLvg6ETOzRhwiDVQn1n07eDOz+hwiDVRve+LhLDOz+hwiDfi2J2ZmjTlEGigVfIqvmVkjDpEGyr7Y0MysIYdIAyXf9sTMrCGHSAPloi82NDNrxCHSQNm3gjcza8gh0oBve2Jm1phDpIHqbU88J2JmVp9DpIHqbU98dpaZWX0OkQZ82xMzs8YcIg1U50SGKu6JmJnV4xBpQBLlojwnYmaWwyHSRKlQ8HCWmVkOh0gTpaI8sW5mlsMh0kS5WPBwlplZDodIE+WifLGhmVmOMYWIpJskvSQpJH2jZv1bJD0saUDS05LeVVN2hqTHU9mjkk6pKTtH0jOS+iWtk7S0puxiSRsl7ZB0n6T5NWXXSOqR1CfpNkltE/8VNFYqFHzbEzOzHHvTE7mrzro7gTcDHwWGgLslHZY+3O8F2oErgAXAPZKKkhamY20DrgROBW4HkHQycDPwFHANcBZwQypbCfwV8G3gJuBDwNV7Uf9xcU/EzCzfmEIkIi4nfZhXpQ/8twJ3RsQXgL8F5gDvA84kC441EbEG+AqwFFgBrAJagesj4u+BfwXeKWkZcGE6/NUR8dfAw8CqFErVsssi4mrgl8CH977Je6fkOREzs1wTmROpDkFtSsuNaXnsJJeVgMWpbCgiemrKFklqGV0xSasldUnq6unpGV28V7KJdfdEzMzqmcyJdaVlvU/cfVm2h4i4JSI6I6Kzo6Mjb7MxKRflr8c1M8sxkRDpTstj0nJRzfrJLBsm63V0A2VJR9aUbYqIwQm0oalSwXMiZmZ5SmPZSNJZwAnp6WJJHwG+AzwOfEDSE8AlQC/ZhHo/sBm4RFIvcBGwAVgHPAl8BrhK0gJgJfBQRDwr6Q7gcuA6SQ8Ap5PNufRLuh14D3CjpG6yIa5PTbD9TZWKBQY9J2JmVtdYeyJXkn3wA5wE3AqcAXwQeJpsUr0FeH9EvBoR/cC5QB9wI1mgnBsRIxHxItnk+lzgc8BjpEnziFgPXAosB64F7ic7u4uIWJvW/QFZ0HwV+PQ42z1mLUXf9sTMLM+YeiIRsaJB8W/m7PMgcGJO2VpgbU7ZGmBNTtk1ZKf+7jelohj2XXzNzOryFetNlAo+O8vMLI9DpImWkm8Fb2aWxyHShG8Fb2aWzyHShG8Fb2aWzyHSRLng256YmeVxiDRRLvnsLDOzPA6RJkruiZiZ5XKINOFbwZuZ5XOINOFbwZuZ5XOINFEuFhiuBBHujZiZjeYQaaJcyO4478l1M7M9OUSaKBWzX5HnRczM9uQQaaJczHoivh28mdmeHCJNlHf2RBwiZmajOUSaKBU9J2Jmlsch0kS5kP2KfJqvmdmeHCJNVHsivgmjmdmeHCJNeE7EzCzfhEJE0oWSos5jSZ11X6vZ7xxJz0jql7RO0tKasoslbZS0Q9J9kubXlF0jqUdSn6TbJLVNpP5jUXZPxMws10R7It8BVqXH+cAg8BKwKZXfW1P+OQBJC4G7gG3AlcCpwO2p7GTgZuApsu9SPwu4IZWtBP4K+DZwE/Ah4OoJ1r+pkudEzMxylSayc0R0A90Akt4HtAD/EBFDkgCeBL4eEdtrdlsFtALXR8Tdkk4Dzpe0DLgwbXN1RDwi6WxglaTVNWWXRUSPpPOADwOfnEgbmimX0nBWxSFiZjbaZM6JXAxUgFtq1v0l0CfpuRQIANWhq2pvZWNaHptTVgIWp7KhiOipKVskqWUS27CH6m1PPJxlZranSQmR1Iv4PeD/RcSGtPqzwHuB1cDhwJ2SZtbbPS3rfUqPpaxefVZL6pLU1dPTk7fZmPi2J2Zm+SarJ3Ix2Yf6F6srIuLjEfG1iLgVeACYTdaj6E6bHJOWi9KyO6dsmKzX0Q2UJR1ZU7YpIgZHVyYibomIzojo7OjomFDDdp3i6+EsM7PRJjQnApCGky4Enge+lda9GzgPWEfWCzkT6CELgruAzwBXSVoArAQeiohnJd0BXA5cJ+kB4HTgzojol3Q78B7gRkndZIH0qYnWv5mWoifWzczyTEZP5L1AB3BrRFQ/aZ8DjgL+mmxepAs4KyIGI+JFssn1uWRnbD1GmjSPiPXApcBy4FrgfuCKVLY2rfsDsqD5KvDpSah/Q77tiZlZvgn3RCLiLrLeRe26J4DfabDPWmBtTtkaYE1O2TVkp/7uNz7F18wsn69Yb8IXG5qZ5XOINOHbnpiZ5XOINLHz7CzPiZiZ7cEh0kT1VvDuiZiZ7ckh0oSvEzEzy+cQaaK88zoRD2eZmY3mEGmi7NuemJnlcog0USwIycNZZmb1OETGoFwsMORbwZuZ7cEhMgblgjycZWZWh0NkDErFgk/xNTOrwyEyBuWiGHRPxMxsDw6RMSi7J2JmVpdDZAxKRflW8GZmdThExqBcKPgUXzOzOhwiY1AqyiFiZlaHQ2QMsjkRD2eZmY3mEBmDUrHgW8GbmdUx4RCRtEFS1Dx+nNafIelxSQOSHpV0Ss0+50h6RlK/pHWSltaUXSxpo6Qdku6TNL+m7BpJPZL6JN0mqW2i9R+LckEMDXs4y8xstMnqiTwIrEqPq9KH+71AO3AFsAC4R1JR0kKy72TfBlwJnArcDiDpZOBm4Cmy71I/C7ghla0E/gr4NnAT8CHg6kmqf0PZ2VkOETOz0UqTdJxu4JsR0Qs7P/AXAB+LiDUpOD4BrABOAlqB6yPibkmnAedLWgZcmI53dUQ8IulsYJWk1TVll0VEj6TzgA8Dn5ykNuQqFwv0Dw3v65cxMzvoTFZP5AJgm6TNki4CqsNTm9JyY1oeO46yErA4lQ1FRE9N2SJJLZPUhlzlYsE9ETOzOiYjRG4F3g+cDwwCXwI0apvq83qz0xMt27NAWi2pS1JXT09P3mZjViqIoWFPrJuZjTbhEImI6yLinoj4J+BfgCK7ehfHpOWitOxOj70pG07H6wbKko6sKdsUEYN16nRLRHRGRGdHR8eE2ge+FbyZWZ4JzYlIOhH4NHB/OtYFwA7gu8Bm4BJJvcBFwAZgHfAk8BmyCfgFwErgoYh4VtIdwOXAdZIeAE4H7oyIfkm3A+8BbpTUTTbE9amJ1H+sWkoFBoYcImZmo020J7KFrOdxLVkwPAesjIgXgHOBPuBGskA5NyJGIuJFsrO45gKfAx4jTZpHxHrgUmB5Oub9ZGd3ERFr07o/IAuar5IF2D73xgXtbHp1B5u39e+PlzMzO2goYnqP9Xd2dkZXV9eEjvHTja/xh59/iL/749/gnJMXNd/BzOwgJ2l9RHQ2285XrI/B8qPnMHdmmYee2TLVVTEzO6A4RMagWBCnL5vPw89sYbr33MzM9oZDZIxOX3YEL7zWT/eW7VNdFTOzA4ZDZIx+67gjAPieh7TMzHZyiIzRG+bPZNHcGXzvmZenuipmZgcMh8gYSeK3jjuCh5/dwohvC29mBjhE9srpx81nW/8wT7zw2lRXxczsgOAQ2QunL8vmRXyqr5lZxiGyFzraW3nzwnZPrpuZJQ6RvfTbb+zgR91befG1HVNdFTOzKecQ2UvnveMNVAK+/N3u5hubmU1zDpG9tHjeTP7oN47mn3/4PFu373EXejOzQ4pDZBz+bMUy+odHuO177o2Y2aHNITIOxx3ZzruWL+C2hzfQ2z801dUxM5syDpFx+rMVx7Gtf5j//cPnp7oqZmZTZkLfbHgoe+viubzz+CO46dv/SdeGrSw/ag6nLZ3HO4+f+NfxmpkdLNwTmYBPnXMC71q+gOe3vs4X1j3L+V/5Eb/c+vpUV8vMbL9xT2QC3jB/Fn/3gZMBeOrFbZx543f5wS9eZvG8mVNcMzOz/cM9kUnypgXtHD6zzA+7t051VczM9psJhYik4yX9u6SXJfVKekDSslQWox5fq9nvHEnPSOqXtE7S0pqyiyVtlLRD0n2S5teUXSOpR1KfpNsktU2k/pOpUBCnLZnHjxwiZnYImWhPZFE6xjXAPwK/D3y5pvxeYFV6fA5A0kLgLmAbcCVwKnB7KjsZuBl4Kh3zLOCGVLYS+Cvg28BNwIeAqydY/0n19mPn8/zW131LFDM7ZEx0TuThiPjt6hNJfwL8ek35k8DXI6L2O2VXAa3A9RFxt6TTgPNTD+bCtM3VEfGIpLOBVZJW15RdFhE9ks4DPgx8coJtmDRvXzoPgB91b+WPfmPRFNfGzGzfm1BPJCJ23vdDUicwD3iwZpO/BPokPZcCAaA6dLUpLTem5bE5ZSVgcSobioiemrJFklpG10vSakldkrp6enpGF+8zbzlqDrNbS54XMbNDxqRMrEt6E3AfsAG4LK3+LPBeYDVwOHCnpHqnLSkt631d4FjK9hARt0REZ0R0dnTsv+s2igXRueRwz4uY2SFjwiEiaTnwHWAY+N2IeBEgIj4eEV+LiFuBB4DZZD2K6g2njknL6rhPd07ZMFmvoxsoSzqypmxTbW/oQPD2pfN5ZnMfW/oGproqZmb73ETPzloMrAOOAL4IvF3SByS9W9I/p2Glq4AzgR6yILgLGASuknQZsBJ4KCKeBe5Ih75O0seA04G7IqKfNPkO3Cjp02SBdNtE6r8vvC3Nizzi3oiZHQImOrG+DKiOF11fs/4E4Cjgr4Ei0AX8z9RreFHSKuBvyM7Y+iHZBDkRsV7SpcBfAO8E7geuSGVrJV0LXAq0AV8FPj3B+k+6ExcdRlu5wA+7t3LmiUdNdXXMzPYpRdSbbpg+Ojs7o6ura7++5p98+Qe8sn2Ib/2Pd+7X1zUzmyyS1kdEZ7PtfMX6PvC2JfN56lfbfL2ImU17DpF94KyTFjKjXOSDt/6QF151kJjZ9OUQ2QeOO7Kdr170Nrb0DvD+L33fd/Y1s2nLcyL70OMbX+X8r/yIclG85ag5SKK1VODMExZy9klH01JyhpvZgWmscyIOkX3syRe2cf39T9E3MEwl4OW+ATa+soMFc1q54DeXcPKvzeXI9lY6Zrcxq7VIqehgMbOp5xBJpjpERqtUgu/8Zw9f+W43Dz2zZY/yclG0lYvMm9XCgvY2Oua0cvLiufzhW49mwZwD5qbFZjbNOUSSAy1Eav1y6+tsfGUHPX0D9PQOsH1gmB1DI+wYHGHr9kFe2tbPi6/18/zW15HgHUvns+JNHbzlqDm8+ah2Oma3IuXe/cXMbNzGGiL+ZsMptHjezDF9C+KzPX383x+/wNd/8gLX3//znetnthSZP7uFI2a38uaFczj/HW9g+dFz9mWVzcx2457IQeaV7YP8/Fe9PPXiNja9uoOX+wbo6Rtg/XOv0D9U4W1L53H2SUcxq6VES6nA3Jlllh81h/mzW6e66mZ2EPFwVjLdQiTPa68P8X+6fsnt39/Axlf2vDbl6MPaeMtRc1h4WBsL5rTR0d7KnLYy7W0lZreVKBVEQaJUFL82byYzW9xJNTuUOUSSQyVEqkYqwebefgaHKwwOV+jpHeBnL7zGzzZt4z9e6mVz7wBbtze/8fExh8/g+CNnM2dGmZZigbZykTcubKfzDYfzxgXtFAueizGbzjwncogqFsRRh83Y+fz4Be2cftwRu20zMDzCy32D9PYP09s/RN/AMCOVoBLQPzRC95bt/MdLvfyiZzvdW7YzOFyhb2CYbf3DALS3llh0+AwOn9nC4bPKtJWLlAqiVCwwq6XI3JktHDajzLxZLcyf1cL82a0sPKyN2a1+u5lNN/5XfQhqLRU5eu6M5hvWiAg2vrKDRzZs5bHnX+VX2/p5ZfsgT/+ql/6hCiOVYLhSYfvACDuGRuoeY+7MMosPn8mR7a20lAqUiwVaSgVaS9mypVSgpbj7+rZykbZygZZicee6ebNa6GhvZd6sFsq+rsZsSjlEbEwk7Tyb7L2nHNNw2/6hEV7bMcTW7YO83DfIlr4BfrWtn19ufZ3nt77Oi6/1MzRSYWikwkAadhsczn4eqlTYmxHW2a0l2tuyx+zWErNas2VrCqlyqcCctjJHzG5h3qwW5s4sM7s1mwtqLRUoprmgQkEUJQoFKBey/UoFUSpo52nUAgoexjPbjUPEJl3WeyiO++LIkUowOFyhf2iE/uHsupmhkbQuDcVt6RtgS98A23ZkQ3K9/cP0DQzT2z/Mr17rzwJpJAun13YMMVyZvLm/YkEUC6KchvBaSwU62ltZMKeNI2a3pOG9AuXirgAqKOsBzmwpMqOlSEsxC7BSUSyc08ayI2czf1aLr/uxg45DxA44xYKYkT5sJ0NEsK1/mK3bB3ltxxB9aS5oYLiS5oKyx0gFRiIYHqkwPBIMjmTlVZUIKpXItqkEIyPZcsfgCD19A7y0rZ+fbXqNwZr9swpk+zYLsjltJTraW5nVWmJmSzH1srJe04yWIjNSONf2jiALqGovSWS9RgmyZ6SfoZB+2LkNUCik9Wkd7NpWyo6dDSMWdw4vzmjJwrBU2DWUWK1O9diF9PpKxxfsHKa06cUhYtOeJA6bUeawGeUprcfwSGXnHQmGKllYDY1U2PRqP89u7uMXW/p45fUhtg8Ms31gmE2v9tPb30tv/zA7Bkd2hdJBrKVYYHZbFpIzUiC1lYqUilmvrpR6eUVlyyzIsp9b6s2hpWHHgrQz+Kpq+3TVYEOiJd1aaEY5m2er9ixLhdQ7LOx+LInd6tRaLu6sQ7HmdQvpNarPD5VepUPEbD8pFQu0Fwu0t+0eZscd2c5vv7EjZ69dRirBwHA2tLdTQJCdWVeJINLz2nmlXeuze7cBu22X9cR2HXDnthFUKuycu+ofyk6a6B8a4fXBkV09q/Ri1UNUKpH2z3qB1dcYGqnQNzBC38AQ2wey42SPLEyzY1ayHmEl6wVGOvxIJdu/f2hk5zzaZA5R7ivVXmBVsSYkCzvDb/fASVkHaLf9q0G1W28z9fiqx0i77dzmHz50Gr82v/ldMSbioAoRSWcAXwTeBDwBfCQiHp3aWpntH8WCfBFojercWSWyIcZKTaiMPjljVygGQ2kIshpI2VBmMDySHWckBVn1GrpqQFfPQKw9EaQawNVhz+pwZzWIa6/Di92OEzvDvXbINKjdt1pz0rF3/w8DNa+xW2jX7NZa3vfDhwfNO1JSG3AvsAO4AvgL4B5Jx0dE/XNKzWzaqs6d2dQ6mGa5zgQWAGsiYg3wFWApsGIqK2Vmdig7mEJkaVpuSsuNaXns6A0lrZbUJamrp6dnv1TOzOxQdDCFyGjV+aY9Ztci4paI6IyIzo6O5hOWZmY2PgdTiHSnZfVy6UWj1puZ2X520EysA/cDm4FLJPUCFwEbgHVTWCczs0PaQdMTiYh+4FygD7iRLFDO9ZlZZma6VdoTAAAECUlEQVRT52DqiRARDwInTnU9zMwsc9D0RMzM7MAz7b/ZUFIP8Nw4dz8C2DKJ1TlYHIrtPhTbDIdmu93msXlDRDQ9vXXah8hESOoay9dDTjeHYrsPxTbDodlut3lyeTjLzMzGzSFiZmbj5hBp7JaprsAUORTbfSi2GQ7NdrvNk8hzImZmNm7uiZiZ2bg5ROqQdIakxyUNSHpU0ilTXafJJul4Sf8u6WVJvZIekLQslZ0j6RlJ/ZLWSVra7HgHG0ltkp6WFJI+n9a9RdLD6e/+tKR3TXU9J4ukuZLukPSqpD5JD6b10/q9LunPJW1I7euWdFlaP23aLekmSS+l9/I3atbnvp8ns/0OkVFqvvyqnezLrxaQffnVdPv2m0Vkf/9rgH8Efh/4sqSFwF3ANuBK4FTg9qmq5D70SXbdzLPqTuDNwEeBIeBuSYft74rtI/8A/AnZ9/D8OfDMdH+vSzoeuAGokP1Ny8BNkhYz/dp9V511dd/Pk/53z74D2Y/qA1hJdnv5K9Pza9Pz35vquk1yO1tGPX+Z7H5kV6T2npvW35GeL5vqOk9i208i+4bMK1PbPg+cnH7+QtrmT9Pzi6a6vpPQ3mNTW/4JaAGKaf20fq+TfY12AN9NP3cB1XvwTat2A0tSG76Rnue+nyf77+6eyJ7G/OVXB7OIGKz+LKkTmAc8yDRvv6QC8GXgC8AjNUXTud3L0/I0YDuwXdJnmd5tJiKeBj4OnAH8nOyDdTWwOG0yLdudNPrbTurf3SHSXO6XX00Hkt4E3Ed2W/3L6m2SltOl/R8m+1/bHez6TprDyIY6ak2ndrem5Szgj4HvAR9jzxuwTqc2I6mD7D39Y+Ac4Cdkvc7ZozdNy2nR7hyN2jih9h9Ud/HdTw6ZL7+StBz4N2AA+N2IeFHSdG//YqCD7AOl6jzg6PTzdGz3hrT8bkSsTR+uv8uuD4/p2GaA3yFr080RcZ+kE4H/BTyVyqdru6Hx59jWBmV7b6rH8g60B9AGvJR+oZeQdfm6SePI0+VB9mG6GRgm6/J/ID2OIguV9WT/i+sl+/CZ8jpPUruXA+9Lj2vI/vd1P9kJBD9J/8AuBX5GdnLB3Kmu8yS0WcDj6e/934AfpL/7CdP5vQ50pr/vz8nmAp5Kz986ndoNnAVcldr2E+AjwPF57+fJ/oyb8l/AgfgA/gvwU2AQeAzonOo67YM2rkhvut0eqey9wLMpTB5kGk2q5/wOPp+e/zrw/dTu/wD+61TXcRLbWm1bf2rbB9P6af1eJzszqTu1+xfApdOt3WTf7jr63/KFjd7Pk9l+X7FuZmbj5ol1MzMbN4eImZmNm0PEzMzGzSFiZmbj5hAxM7Nxc4iYmdm4OUTMzGzcHCJmZjZu/x8fX3+TU7u3nwAAAABJRU5ErkJggg==\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 高频词绘图\n",
    "\n",
    "# y坐标：前100个高频词的词频\n",
    "frequiences = [f for w, f in words_count.most_common(100)]\n",
    "# x坐标：100个词\n",
    "x = [i for i in range(100)]\n",
    "# 绘图\n",
    "plt.plot(x, frequiences)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:32:45.107441Z",
     "start_time": "2020-11-29T08:32:44.755822Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x1ff15745a90>]"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAD8CAYAAAB+UHOxAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvnQurowAAIABJREFUeJzt3Xl4HNWZ7/Hv291qSa19ty3JyCu2MRhjASEsJkCSCRBCBkIw4bKFcZIhMEwy3GwzN5mQe7MxkDAhTwa4CTABk7AmMWRCYDCEHXnD2I6N8b5Jso21L93qM390yQhZkiW3pJa6f5/n6ae6qk9Vvcdq69WpU3WOOecQEZHU40t0ACIikhhKACIiKUoJQEQkRSkBiIikKCUAEZEUpQQgIpKilABERFKUEoCISIpSAhARSVGBRAcwkOLiYldVVZXoMERExpXly5fvc86VHKncmE4AVVVV1NTUJDoMEZFxxcy2DaacLgGJiKQoJQARkRSlBCAikqKUAEREUpQSgIhIilICEBFJUUoAIiIpKikTwO6Dbdz+zAa27GtJdCgiImNWUiaA/c2d3Pnfm9hU15zoUERExqykTAChdD8ArZ2RBEciIjJ2JWUCyArGRrho6ehKcCQiImPXoBKAmd1pZrVm5sxsqbet0Mye9ra3mtmrZrZggGNcbGabzKzdzJaZ2ZThqkRvmUG1AEREjmQoLYCHe63nAuXAD4AfAqcCj/a1o5lN8PZvBG4BFgD3DzXYwQodSgBqAYiI9GdQo4E6524ysyrgph6bdwLznXNRADO7CDjJzELOudZeh1gEpAPfd849YmYnA//LzKY5596NtxK9pfl9BAM+WtQCEBHp11H3ATjnIj1++R8DzAKW9/HLH6D7cs8ub7nTW07tXdDMFptZjZnV1NfXH214ZAX9tKoPQESkX3F3AnuXd54GOoCrB7ubt3S9P3DO3e2cq3bOVZeUHHE+g36FggG1AEREBhBXAjCzScAyYCLwMefc2h6fZZhZ0Fvd4i0rvGV5r+3DLhT006Y+ABGRfg32LqALgM96q5Vmdr2ZHQ+8ABwL3AtMN7PLzSzLK9cGrPDePwx0Al8zsxuBTwMvjcT1/26h9AAtSgAiIv0a7JSQtwALvfcnAPcA1wLTe3zebQrwgTEYnHN7zGwR8GPgNuB1b/8RE+sD0CUgEZH+DPYuoLP7+ei+AfaxXuuPA48PNrB4hYIB3mttG63TiYiMO0n5JDBAVrpfD4KJiAwgaRNAKBjQg2AiIgNI4gSgPgARkYEkbQLICvppDXcRjR72qIGIiJDECSCUHsA5aI/oMpCISF+SNgFkeQPCaUhoEZG+JW0CCHlzAuhOIBGRviVxAtCQ0CIiA0neBJCuFoCIyECSNgGoD0BEZGBJmwDUByAiMrCkTQBZ6WoBiIgMJGkTwKEWQFgJQESkL0mcALy7gDQchIhIn5I2AWSmeZeAdBuoiEifjpgAzOxOM6s1M2dmS3ts/4aZ7fC2v32EYyzzynW/Dg5H8APx+UwDwomIDGCwLYCH+9iWBvznEM61Hljkva4bwn5HLTYxvFoAIiJ9OeKMYM65m8ysCrip1/bvQqwlMMhz1QFPOeeahhjjUdOkMCIi/RvNPoCzgEYzazSzb43GCTPT/BoKQkSkH6OVAB4DrgQ+A+wAvmdmZ/ZV0MwWm1mNmdXU19fHddKs9IBaACIi/RiRBGBmQTPL6F53zv27c+4h59yjwC+8zXP62tc5d7dzrto5V11SUhJXHKGgXw+CiYj044h9AGZ2ATDXW600s+uBF4CJwExve4G3fYVzbgXwDLDQzEqAMPB74AmgDbgZiAJvDmdF+pIVDFDb2D7SpxERGZeOmACAW4CF3vsTgHuAa4Gzgau97ZO87f8KrOi1fwdQD3wdyAM2A1d5iWJEhdLVAhAR6c9g7gI6u5+P7gOuGeQ+lw4hpmGTFQzQpqEgRET6lLRPAkN3H4A6gUVE+pLkCSBARyRKpCua6FBERMacpE4A3UNCa0RQEZHDJXUCODQktDqCRUQOk9QJ4NCkMHoYTETkMEmdALqHhG7TcBAiIodJ6gSQlR67BKQ7gUREDpfUCeDQrGBqAYiIHCapE8ChFoD6AEREDpPUCeD9eYHVAhAR6S2pE0BW922gagGIiBwmqRNAZlATw4uI9CepE0B6wIffZ2oBiIj0IakTgJlpUhgRkX4kdQKAWD+AWgAiIodL+gQQSverD0BEpA+DSgBmdqeZ1ZqZM7OlPbZ/w8x2eNvfPsIxZpvZK2bWYWYbzOxj8QY/GKGgX0NBiIj0YSgtgIf72JYG/Ocg918CzAK+Qmye4EfMLG8I5z8qoWBAQ0GIiPRhUAnAOXcTcEcf27/rnPvmkfY3s/nAPGCJc+4u4HYgl1GYKjIr6NdQECIifRitPoAp3nKXt9zpLaf2Lmhmi82sxsxq6uvr4z5xKD2goSBERPqQqE5g85au9wfOubudc9XOueqSkpK4T5QV9GsoCBGRPoxYAjCzoJlleKtbvGWFtyzvtX3EhHQbqIhInwKDKWRmFwBzvdVKM7seeAGYCMz0thd421c451YAzwALzazEObfSzN4CLjeztcCXgCbgsWGsS59CXh+Acw4zO/IOIiIpYrAtgFuAH3jvTwDuAU4HrvPeA0zy3l/UzzGuADYQ6wAOApc55w4eRcxDkpUeIBJ1dHZFR/pUIiLjyqBaAM65s/v56D7gmsHs45xbC5w26MiGSc8hodMD/tE+vYjImJX0TwJ3DwmtO4FERD4o6RNAKF3TQoqI9CX5E4DmBRYR6VMKJABvVjANByEi8gFJnwDe7wNQC0BEpKekTwDdfQDNHeEERyIiMrYkfQIoz88kGPCxdldjokMRERlTkj4BZKT5mV+Zz+tbDiQ6FBGRMSXpEwDAqVOLWLu7gcZ2XQYSEemWEgngQ1MKiTpYvvW9RIciIjJmpEQCmD+5gDS/8dqW/YkORURkzEiJBJAZ9DOvIp/XN6sfQESkW0okAIBTpxayZleD5gcWEfGkTgKYUkRX1LF8m/oBREQghRLAgmMK8PuM19UPICICDDIBmNmdZlZrZs7MlvbYPtvMXjGzDjPbYGYfG+AYrtfryeGowGBlpQc4vjxP/QAiIp6htAAe7mPbEmAW8BUgDDxiZnkDHOMxYJH3um0I5x4Wp04tZPXOg7RpXCARkcElAOfcTcAdPbeZ2XxgHrDEOXcXsakec4FLBzjUOuAPzrmHnXMvHV3IR+9DU4oIdzlWblc/gIhIPH0AU7zlLm+501tOHWCffwaazWybmV0Yx7mPSnVV7HmA5zfUjfapRUTGnOHsBDZv6fr5/IfA3wKLgQJgiZmFDjuI2WIzqzGzmvr6+mEMD3Iy0lg4s5Tfr95NV7S/MEVEUkM8CWCLt6zwluU9t5tZhpkFuws7577unHvSOXcP8GcgG6jsfVDn3N3OuWrnXHVJSUkc4fXt4vmTqG3s4LXNuhtIRFJbYDCFzOwCYK63Wmlm1wMvAG8Bl5vZWuBLQBOxjl6ANmAtMNfMzgeuBJYR++v/E0A97yeRUXPe7DKy0wM8uXIXp08vHu3Ti4iMGYNtAdwC/MB7fwJwD3A6cAWwgVgHcBC4zDl3sI/9twETgR8R6weoAS5wznUefehHJyPNz9/MncB/vb2X9rDuBhKR1DWoFoBz7uwBPj6tn32sx/u1wEeGFNkIuvjEch5dvpPn1tdxwQkTEx2OiEhCpMyTwD2dNq2I0px0nly168iFRUSSVEomAL/PuGjeJJZtqONg66hfhRIRGRNSMgEAXDy/nHCX4+bfrOKJlTupa2pPdEgiIqNqUH0Ayei4Sbl84aypPLJ8J8s2xJ43+JcL5/D5M6YcYU8RkeSQsi0AM+Mb58+m5lvnsfTGMyjPz+TVd/VsgIikjpRNAN18PmNueR6zJ+aw873WRIcjIjJqUj4BdKsoCLHjQCvOaYgIEUkNSgCeysIQLZ1dHGjRXUEikhqUADyTC2Pj0u14ry3BkYiIjA4lAE9lYSYAOw6oH0BEUoMSgKeyINYC2K4EICIpQgnAk5UeoCgrqDuBRCRlKAH0UFEYUgtARFKGEkAPlQWZ7DigTmARSQ1KAD1MLgyx+2Abka5ookMRERlxSgA9VBaGiEQdexo0MJyIJL9BJQAzu9PMas3MmdnSHttnm9krZtZhZhvM7GMDHONiM9tkZu1mtszMxtyoa+8/C6B+ABFJfkNpATzcx7YlwCzgK0AYeMTM8noXMrMJ3v6NxKaXXADcP+RoR1j3raA71Q8gIilgUAnAOXcTcEfPbWY2H5gHLHHO3UVsXuBc4NI+DrEISAe+75z7d+AJ4EwzmxZH7MNuYn4GPtOzACKSGuLpA+i+hNM9r+JObzk1nrJmttjMasyspr6+Po7whi7N72NSfqYuAYlIShjOTuDuSeAHM5xmv2Wdc3c756qdc9UlJSXDFtxgVXqjgoqIJLt4EsAWb1nhLct7bjezDDMLDqbsWFJZmMl29QGISAoY7F1AFwCf9VYrzex6oBl4C7jczG4g1hHcBDzmlWsDVnjvHwY6ga+Z2Y3Ap4GXnHPvDksthtHkwhD7mjto6+xKdCgiIiNqsC2AW4AfeO9PAO4BTgeuADYQ6wAOApc55w723tk5t4dYR3A+cBuwErgmnsBHSqV3K6jGBBKRZDeoSeGdc2cP8PFp/exjvdYfBx4fdGQJUtFjVNAZZTkJjkZEZOToSeBeDj0Mpo5gEUlySgC9FGcHyUzza2YwEUl6SgC9mBlTS7JYuf29RIciIjKilAD68KkTJ7Fi+0E21jYlOhQRkRGjBNCHS06qIM1vLHlje6JDEREZMUoAfSjKTufjx03gseU7aQ/reQARSU5KAP244tTJNLZHeHrNnkSHIiIyIpQA+nHa1CKqikI89LouA4lIclIC6IeZseiUydRse0+dwSKSlAb1JHCqunRBBbc9s4Gv/HYVM8tyCPp9nD69mE/Om5To0ERE4qYEMICi7HS+uHAaT721hze2HKC5I8JvanaQm5nGwpmjP1S1iMhwMucGM3x/YlRXV7uamppEh3FIW2cXn/75y9Q2tvPUTWcyKT8z0SGJiBzGzJY756qPVE59AEOQGfTz88+dRLjLccNDK+iMRBMdkojIUVMCGKKpJdn88JITWLn9IF99ZDWrdhwkGh27rSgRkf6oD+AoXHDCRNbtmcbPl73LH1bvpigryIemFlFZGKK8IJNpxVlUVxUSDCi/isjYFXcfgJldB3wLmAS8CFznnNvVq0wVh0//+FPn3M0DHXus9QH0dqClkxc21vH8X+tZteMgexraCHfF/j1z0gN8ZFYpn5w3iY/OKUtwpCKSSgbbBxBXAjCzauAN4CXgUeDHwJ+ccxf1KldFLAH8AnjB27zBObdyoOOP9QTQW1fUUd/Uwdu7GvjzulqeXV/L/pZOvv3JOVx7+pREhyciKWKwCSDeS0ALAQP+wzn3oJktAi40syLn3P4+ytcAv3fOJeVsK36fMSEvgwl5GZw3p4xIV5S/f3AFty5dR2VBiPPUEhCRMSTei9R13vIMM5sFzCCWEKr6KX8P0GJma83sQ3Gee8wL+H385PITmVuex41LVrJmZ0OiQxIROSTeBPBb4GXgi8B6YhPDA7T3KtcCfBu4GPgnYCbwYF8HNLPFZlZjZjX19fVxhpd4oWCAe6+upjAryHX3v8lrm/tqGImIjL7h6AT2AccDEeAnwBlAIeCALudcuI99lgMnAZnOud7J4pDx1gcwkHdqm7ju/jfZcaCNy6or+Ob5s8kPBY+8o4jIEI1KH4CZ+YHbgZXAycB53noZsU7fp4j1Cfyd9/nrwBTgRGD1QL/8k82MshyeuXkhP3luI/f+ZQvPrq/jonmT+PhxEzi5qoCAX7eMisjoivcuIB+wAphF7DLPQ8Qu8UzESwDOuQvNbCHwQ+A4Yi2FV4CbnXPvDHT8ZGoB9LRudyN3PLuRFzfW0xGJUpQV5N6rq5k/uSDRoYlIEhiV20BHWrImgG4tHRFe3FjPrUvXkZORxtKbziBNLQERiZPGAhoHstIDfOL4iXznouPYUNvEfS9vTXRIIpJClADGgI/OKePcWaXc8exG9jS0JTocEUkRSgBjgJnxnYuOoyvquHXpukSHIyIpQoPBjRGVhSFuPGc6tz2zkUV3v0ZFQSblBZlcclIFlYWhRIcnIklICWAM+buzplLX1MGaXQ28+E49dU0d3P/KVn7+uQWcNq0o0eGJSJLRXUBj2NZ9LXz+/jfZtr+VWy+ey6JTJic6JBEZB3QXUBKoKs7iiRtO5/TpxXzj8TVcf38Nr2/ez1hO2iIyfugS0BiXm5HGL685mZ8/v4lfvryFZ9fXMrc8l0tOquD06cXMKM3GzBIdpoiMQ7oENI60dXbxxMpd3PfKFjbWNgNQnJ3OjNJsMoN+MoN+Tj6mgKtOq8LnU1IQSVV6EjjJ7TjQyqvv7ueVd/ex+2A7reEIjW0Rth9o5dxZpdx+2YnkhdISHaaIJIASQApyzvHAq9v43lPrmJCXwfcuPp7jJuVSlBXUZSKRFDJaM4LJGGJmXP3hKo6vyOPLD67g6l++AUB2eoAPTyvip5fPJzPoT3CUIjJWqAWQpBrbw9RsPcC2/a28U9fMkje2c97sMn5x5QL86h8QSWpqAaS43Iw0zpn1/hzEM0uz+c4f1vHdP6zlOxcdp0tCIqIEkCquOX0Kuw62cc9ftlCam8EXzpqqSWhEUpwSQAr5xidms7uhnR//aQMPvLqVS06q4FMnllNRkEko6FerQCTFDMecwNcB3wImAS8C1znndvVR7gvAvwBFwDNeuQFnSFcfwPDrijqeXV/Lb9/cwfMb6oh6P/6MNB+T8jM5f+5ELllQwZTirMQGKiJHbVRuAzWzauAN4CXgUeDHwJ+ccxf1Kjef2NSRzwJ/Bv4f8JBz7qqBjq8EMLL2NrTz8qZ91Dd3sL+5g7/ubeLlTfuIOji5qoDPnzGVj80p00NlIuPMaHUCLwQM+A/n3INmtojYJPBFvf66v8ZbftM596aZXQgsMrPFqTQx/FgzIS+DSxZUfGDb3oZ2nli5iyVvbOeLv17OtJIsvnT2dD49v1x3D4kkmXh7Aeu85RlmNguYQSwhVPUqN8Vbdl8a2kks+VT2PqCZLTazGjOrqa+vjzM8GaoJeRl86exp/PdXF3LnovkEA37+6ZHVXHnv69Q2KleLJJN4E8BvgZeBLwLrgaC3/Ui/Kbr/lDzs+pNz7m7nXLVzrrqkpCTO8ORoBfw+Lpo3iadvOoMfXXICq3Yc5Pyf/oVlG+qOvLOIjAtxXQJyznWY2VnA8UAE+AlwBrDZzDKALudcGNji7VIB7AbKvfI74zm/jDwz47KTKznpmHxueHAl1/zqTSblZTClJIspxVnMKM3h2Ak5HFuWQ0FW8MgHFJExI64EYGZ+4HZgJXAycJ63Xkbsl/5TwIXAA8BNwP81sz8DHwaW6Pr/+DG9NIffffl0Hnh1K+v3NLF5Xwu/X7WbxvbIoTL5oTQmF4aoLAhx7IQcTqzMZ15lPnmZGpROZCyKtxPYEesI/gLQAvwM+CYw8QOFnFtuZjcQu130TOCPwD/GeW4ZZRlpfhafNe3QunOO2sYONtQ2sXFvE1v3t7DjvTbW7m7g6bf30H2DWWVhJtNLsplWks1ZM0s4c0axnjkQGQM0FpCMiMb2MG/taGDVjvfYUNvMprpmNtc30xGJcuaMYr51wWxmTchNdJgiSUnDQcuY0xmJ8uvXtvHT596hqT3MxSeWc8Wpk1lwTIFaBCLDSAlAxqyDrZ3c+dwmfvPmdlo6u5haksWlCyq48PhJTC4KJTo8kXFPCUDGvJaOCE+v2cNva3bw5tb3AJhbnst5s8uYNSGH6aU5HFMUIk2D1okMiRKAjCs732vlj2v2snTNHlbvOHhoe25GgJvOncFVp1URDCgRiAyGEoCMWy0dETbXt/BOXRNPrtrNixvrmVqcxTfOn825s0o1NpHIESgBSFJwzrFsQz23PrWOzfUtTC4M8dmTK7l0QQVluRmJDk9kTFICkKQS7ory9Jo9LHljO69tPgDEhrAuykqnMCv4gdeU4iyOnZDDjNJsstID+MzwGbrTSFKGpoSUpJLm9/GpE8v51InlbNnXwp/X7aW+qYP9LZ3sb+7kYGsnm/c1s6+pk7ZwV5/HyMkIUJKdTnFOOuX5mVQWhphcGOKsGcWUqjUhKUgJQMadKcVZH3giuSfnHLsb2tm4t4l365tpD3cRdRCJOhrbwtQ3d1Df1MEbWw7wu1W7iDooCKXxk8vns3CmBh+U1KIEIEnFzCjPz6Q8P5OPzCodsGxnJMqGvU3c8uhqrvnVG9x87kxuPGe6OpklZagPQFJeW2cX33pyDY+v2EV5fiazJ+YwoyyHCbkZpPl9pPmNSfmZnDKlUM8kyLigPgCRQcoM+vm3z8zjzBnFPLu+jndqm3hhYz3hrg/+cVQQSuNv5k7gI8eWUlWcRWVBiMygP0FRi8RPLQCRPoS7ojS0hYl0OTojUdbvbeSpt/bw3PpaWjrf72SeMzGXX19/KoWaC0HGELUAROKQ5vdRnJ1+aH1yUYiPHzeB9nAX6/Y0suNAK1v3tXLXsk38w8Mrue/aUzRnsow7SgAiQ5CR5uekyQWcNLkAgLLcdL7++Bp++tw7fOWjMxMcncjQxN2jZWY3m9lWM+swsy1mdmM/5Vyv15Pxnlsk0bqfSr7zuXd4XvMlyzgTVwIwsxnAHUAU+AqQBtxpZpX97PIYsMh73RbPuUXGAjPj1k/NZfbEXG58aCU/+q+/svtgW6LDEhmUeFsA3fvvAp4F9gIdQH9z/a4D/uCce9g591Kc5xYZEzKDfu65agGnTSviFy+8y5k/ep6/f3A56/c0Jjo0kQHFfReQmX0N+D5gxFoC1zrnHuijnCM2h7AB24EbnHNLBzq27gKS8WbHgVZ+/do2HnpjO03tES48YSL/+NGZTCvJTnRokkJGZTA4MysBVgJ1wL8C3wamA3Occzt7lf0B8BpQAvwbsURQ5pxr7VVuMbAYYPLkyQu2bdt21PGJJEpDa5i7//Iuv3p5K23hLk4+ppDzj5/Ax+dOoDAriGH4faY7h2REjFYCuAz4DfAvzrnvmdk/A7cClwF/AKLOuc4+9nsM+FtglnNuQ3/HVwtAxrt9zR08+Np2nlqzm421zYd9npHmIzcjjdzMNHIyAmSnB8jJCJDm9+E3w8zISveTl5lGXmYa2ekBMoN+0gN+stL9h8pnBgOkB3ykB3xkpvkJ6InllDZazwFs9pZXmtke4HPe+kagDVgLzDWz84ErgWVAAfAJoB7YEuf5Rca04ux0/uG8GfzDeTPYVNfEixv30R7pwjmIdDlaOiM0tIZpaAvT0hmhqT3CnoZ2wl1Ros4RjRIr0xZmKH+rZab5yckIMDE/k3OOLeXjc8s4tixHQ2LLB8SVAJxzNWb2VeBG4C5gN/Bl59zqXl+0bcBE4EeAH6gBvtpX60AkWU0vjc1zfDSiUUdTR4SWjgjt4S5aO7toC3fR3B6hqSNCa0eEzq4oHeEorZ1dNLWHaWqP8E5dEz95biN3PLuRqqIQi06ZzGeqK/XksgAaCkIk6dU1tfPc+jqeWLGLN7YeIBjwce6sUvJDQYJ+I+D3HRr0Luj3kZuZRn4odlmq+9JTQSiopDGOaCgIEQGgNCeDRadMZtEpk9mwt4kHX9/Gc+vr6IhECXfFXpEuR2dXdMDjnDmjmG+eP5vZE3NHKXIZaWoBiAgQm0wn3OVoag9zsC3MwdYwje1hGtvCbNvfyi9f3kJDW5hLT6rgy+dM55iirESHLP1QC0BEhsTMCAaMoux0inoMhNft6tOquGvZJu57eSuPLN/JGdOLueLUyZwzq5SMNA2LPR6pBSAiQ1Lb2M5v3tzBw29sZ3dDOwGfMXtiLidW5jN7Yi5TS7KYWpxFSU667jpKkFF5DmCkKQGIjF1dUcdLm/bx2ub9rN5xkLd2NtDcETn0ecBn5IdincjTS7P55LxJnDurTJPojAJdAhKREeX3GQtnlrBwZgkQu1V1d0MbW/a1sLm+hbqmdg62xvoSarYd4E9ra8kK+jlvThmfmDuRs48t0aWjBFMCEJFh4fMZFQUhKgpCnDmj5AOfdUUdr2/Zz+9X7ea/1u7ld6t2k5nm55QphUzKz6Q0J52i7KB3O6qPgM/ovnoU8PkoCKVRkBWkLDdDt6MOI10CEpFRFemK8vqWA/zx7T0s33aQ+qZ29rd0DvpJ53NmlfLFhdM4uapAfQz9UB+AiIwb3XMwdz+TEO6K4iA2ZEY0ynstYd5r7eSve5v49WvbONDSyYmV+Xx4WhEzy3KYXprNjLJs0gO6pARKACKSpNo6u3hk+Q4een07m+qaiURjv8MCPmN6aTZzJuZSmBUkI81PZtDPhNwMppVmM7Uki9yMtARHPzrUCSwiSSkz6Oeq06q46rQqOiNRtu1vYWNtM+v2NLBudyOvvLufxvYw7eEuor3+vj2hIo/PVFdy0bxJ5GWmRjIYiFoAIpKUnHN0RKLsOtjGu3XNbKxt4qk1e1m/p5H0gI8FxxRQWRCisjDT64jOoDQ3neLsdHIzAuN6SG1dAhIR6cU5x9rdjTxSs4O3djWw40Ab+5o7+izbPT+D32cEfEZWeoAZpdnMnJDDjNIcJuZlUJabQVFWEN8Ym9hHl4BERHoxM+aW5zG3PO/QtrbOLvY2tlPX2E5tUwf7mztoaIvN0dDcHqHLOaJRx3utYV7fcoAnV+3udUxID/gI+n1kePMw5HqT9wT9Pvw+I80bZbUoK0hBVjB2W2soSF4ojfzM2MiruRlpBAOj2+pQAhCRlJYZ9DOlOIspxYMb3K6hLczm+mZqGzuobWxnf3MHHZEoHZEo7eEumtojNHrzMUSisbuaIlHHwdbYnUxdvTsmekgP+MhOD5CdEeCjs8v45wvnDFc1+6QEICIyBHmZacyfXHBU+0ajjsb2MO+1hjnY2nloxNWGtjANrWGaOyM0t0do7ogwIS9jmCM/XNwJwMxuBm4mNuPXbuB259y/91HuYuA2oILY5PDXOuc0JaSIpAyfz8gPBckPBYHED6cd1wUcJe4WAAAFCUlEQVQnM5sB3AFEga8AacCdZlbZq9wE4GGgEbgFWADcH8+5RUQkPvH2OHTvvwt4FtgLdADtvcotAtKB73utgyeAM81sWpznFxGRoxRXAnDObQC+DpwO/BWYDyx2ztX3KjrFW+7ylju95dTexzSzxWZWY2Y19fW9DyMiIsMl3ktAJcCNwCrgYmA18DMzqzjSrt7ysO5w59zdzrlq51x1SUlJ749FRGSYxHsJ6CNAOfC4c+53wONADnCamWWYWfe4rd2dvd2JobzXdhERGWXx3gW02VteaWZ7gM956xuBNmAtMJdYB/APgK+ZWRnwaeAl59y7cZ5fRESOUrx9ADXAV4l18N7lLb/snFvdq9weYh3B+cRuBV0JXBPPuUVEJD5xPwfgnLsduL2P7dZr/XFil4hERGQMGNODwZlZPbAtjkMUA/uGKZzxIhXrDKlZb9U5dQy13sc45454F82YTgDxMrOawYyIl0xSsc6QmvVWnVPHSNV7/A54LSIicVECEBFJUcmeAO5OdAAJkIp1htSst+qcOkak3kndByAiIv1L9haAiIj0IykTgJmdbmZvmVmHma0ws5MSHdNwM7MZZva8me03syYz+3P36KpmdrGZbTKzdjNbZmZTjnS88cQbZmSDmTkz+5m3bbaZveL9zDeY2ccSHedwMrN8M3vAzA6aWbOZvehtT9rvupndbGZbvbptMbMbve1JVWczu9PMar3v89Ie2/v9Tg/Xv0HSJQAzywAeIzYm0T8CZcCjZuZPaGDDr5zYz+/bwK+A84B7U2Tuhf/D++NKdVsCzCI2L0UYeMTM8nrvOI79kthQK/+f2ARMm5L5u36EuUaSsc4P97Gtz+/0sP7cnXNJ9SI2zpADbvHWv+utn5vo2Ia5nsFe6/uBOu8L4YDPeNsf8NanJTrmYar3CcTGmbrFq9fPiA1D7oC7vDLXeeufT3S8w1TnqV59fg0EAb+3PWm/68CxXl3+4r2vITbPyGeSsc5AlVePpd56v9/p4fy5J10LgCHMPTCeOec6u9+bWTVQCLxIEtffzHzAvcTGnXqzx0dJW2dP98zgJwMtQIuZ/ZAkrrfrZ64RoHu2waSrcy8D/WyH7eeejAmgt37nHkgGZnYs8DtgK7G5GQ4r4i2Tof7XEvtL6QHeH1I8j9jlgZ6Sqc4QG2QRYpPIfhZ4GfjfHD6WV9LUu7+5RoDs3kW95biv8xEMVM+j/jeIezC4MShl5h4wsznAfxObhvMc59weM0vm+lcCJcR+GXS7EpjkvU/GOkMsuQP8xTn3uPfL8Rze/4+fjPXunmvkF86535nZ8cCtwHrv82Ssc08D/T8+MMBnQ5Poa18jcC0tA6j1/jG+RKyZtAXvummyvIj9MqwDIsSaypd7r4nEEsJyYn9BNRH7xZHwmIehznOAS73Xt4n9xfNHYh3dq73/GDcAbxPrBM9PdMzDVG8D3vJ+3n8HvOb93Ocm63cdqPZ+vn8ldt17vbc+L9nqDFwAfM2r32rgemBGf9/p4fwdl/DKj9A/6FnAGqCT2NwD1YmOaQTqeLb3hfnAy/vsb4F3vUTwIknSAdxP/X/mrR8HvOrVeSPwN4mOcZjr212/dq9+V3jbk/a7Tuzuly1enTcDNyRjnYFlffxfvmag7/Rw/RvoSWARkRSVCp3AIiLSByUAEZEUpQQgIpKilABERFKUEoCISIpSAhARSVFKACIiKUoJQEQkRf0PVq4XtmtNUDYAAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "# 半指数绘图\n",
    "plt.plot(x, np.log(frequiences))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "计算概率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:36:47.586611Z",
     "start_time": "2020-11-29T08:36:47.566622Z"
    }
   },
   "outputs": [],
   "source": [
    "# 计算每个词出现的概率\n",
    "def prob_1(word):\n",
    "    return words_count[word] / len(TOKEN)\n",
    "\n",
    "# count(wk)/(number of words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:36:57.299097Z",
     "start_time": "2020-11-29T08:36:57.288107Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.001554473157589251"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prob_1('我们')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "条件概率：p(w1|w2) = count(w1,w2)/count（w1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:40:18.304826Z",
     "start_time": "2020-11-29T08:40:17.041603Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['此外', '自', '本周', '6', '月', '12', '日起', '除', '小米', '手机']"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 列表中元素转为字符串\n",
    "TOKEN = [str(t) for t in TOKEN]\n",
    "TOKEN[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:40:43.742528Z",
     "start_time": "2020-11-29T08:40:40.762334Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['此外自', '自本周', '本周6', '6月', '月12', '12日起', '日起除', '除小米', '小米手机', '手机6']"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 连接相邻的两个词\n",
    "TOKEN_2_GRAM = [''.join(TOKEN[i:i+2]) for i in range(len(TOKEN[:-2]))]\n",
    "TOKEN_2_GRAM[:10]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:41:01.447453Z",
     "start_time": "2020-11-29T08:40:59.675543Z"
    }
   },
   "outputs": [],
   "source": [
    "# 相邻连词计数\n",
    "words_count_2 = Counter(TOKEN_2_GRAM)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:44:50.318765Z",
     "start_time": "2020-11-29T08:44:50.306771Z"
    }
   },
   "outputs": [],
   "source": [
    "# 计算条件概率\n",
    "def prob_2(word1, word2):  # p(w1,w2) = count(w1,2)/count(w1)  \n",
    "    if word1 + word2 in words_count_2: return words_count_2[word1+word2] / words_count[word1]\n",
    "    else: # 不存在的概率设为非零值\n",
    "        return 1 / len(TOKEN_2_GRAM)\n",
    "    \n",
    "#  (w1 w2), (w3,w4) (w4,w5)  2-gram\n",
    "# (w1,w3)  1/3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "######  此处分母用count(w1)还是count(w2)，对概率结果影响比较大"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:45:45.719127Z",
     "start_time": "2020-11-29T08:45:45.704139Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.030128874956461164"
      ]
     },
     "execution_count": 82,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prob_2('我们', '在')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:45:46.583385Z",
     "start_time": "2020-11-29T08:45:46.568395Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.1110407430863417e-05"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prob_2('在', '吃饭')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:45:47.060969Z",
     "start_time": "2020-11-29T08:45:47.039988Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.707199580708929e-07"
      ]
     },
     "execution_count": 84,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "prob_2('去', '吃饭')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "语言模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:47:06.723491Z",
     "start_time": "2020-11-29T08:47:06.703508Z"
    }
   },
   "outputs": [],
   "source": [
    "# 基于语言模型，计算一条语句出现的概率\n",
    "def get_probablity(sentence):\n",
    "    words = cut(sentence)\n",
    "    \n",
    "    sentence_pro = 1\n",
    "    \n",
    "    for i, word in enumerate(words[:-1]):\n",
    "        next_ = words[i+1]\n",
    "        \n",
    "        probability = prob_2(word, next_)  # p(w1|w2)\n",
    "        \n",
    "        sentence_pro *= probability  # p(s) = p(w_1)p(w2|w1)*p(w3|w2)..p(wn|wn-1) \n",
    "    \n",
    "    return sentence_pro"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:47:17.626504Z",
     "start_time": "2020-11-29T08:47:17.607520Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "6.743762360853308e-35"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_probablity('小明今天抽奖抽到一台苹果手机')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 87,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:47:24.517962Z",
     "start_time": "2020-11-29T08:47:24.504974Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7.989690983840629e-36"
      ]
     },
     "execution_count": 87,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_probablity('小明今天抽奖抽到一架波音飞机')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 88,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:47:30.204900Z",
     "start_time": "2020-11-29T08:47:30.191912Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.9840875058382383e-20"
      ]
     },
     "execution_count": 88,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_probablity('洋葱奶昔来一杯')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 89,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:47:40.529884Z",
     "start_time": "2020-11-29T08:47:40.511895Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7.3289295697906e-14"
      ]
     },
     "execution_count": 89,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_probablity('养乐多绿来一杯')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 90,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:48:27.866770Z",
     "start_time": "2020-11-29T08:48:27.839785Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sentence: 这个好看的女人看着一个女人 with Prb: 3.62337402355287e-28\n",
      "sentence: 这个桌子看见一个小小的篮球 with Prb: 2.9056262380901015e-25\n",
      "sentence: 这个蓝色的小小的篮球看见这个好看的小猫 with Prb: 2.233653670683022e-41\n",
      "sentence: 一个小小的蓝色的桌子听着这个蓝色的女人 with Prb: 3.7849014935357674e-38\n",
      "sentence: 这个小猫看着一个桌子 with Prb: 2.9298397900741853e-24\n",
      "sentence: 一个篮球听着一个好看的篮球 with Prb: 5.212286615014935e-29\n",
      "sentence: 一个小小的女人听着一个蓝色的女人 with Prb: 6.58742342357102e-32\n",
      "sentence: 一个小猫听着一个好看的桌子 with Prb: 2.3165718288955266e-29\n",
      "sentence: 一个女人看着一个好看的桌子 with Prb: 4.9977572738660276e-29\n",
      "sentence: 这个小小的桌子坐在这个小小的好看的蓝色的桌子 with Prb: 8.715542627988036e-51\n"
     ]
    }
   ],
   "source": [
    "# 根据语法描述生成10个句子，计算出现的概率\n",
    "for sen in [generate(gram=example_grammar, target='sentence') for i in range(10)]:\n",
    "    print('sentence: {} with Prb: {}'.format(sen, get_probablity(sen)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 91,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T08:54:32.391220Z",
     "start_time": "2020-11-29T08:54:32.362235Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "明天晚上请你吃大餐，我们一起吃苹果 is more possible\n",
      "---- 今天晚上请你吃大餐，我们一起吃日料 with probility 6.684624207742742e-46\n",
      "---- 明天晚上请你吃大餐，我们一起吃苹果 with probility 7.542849854956504e-46\n",
      "真是一只好看的小猫 is more possible\n",
      "---- 真事一只好看的小猫 with probility 2.1153007661637964e-26\n",
      "---- 真是一只好看的小猫 with probility 7.813612196297205e-20\n",
      "今晚我去吃火锅 is more possible\n",
      "---- 今晚我去吃火锅 with probility 5.012457937326253e-16\n",
      "---- 今晚火锅去吃我 with probility 1.563034443630964e-18\n",
      "养乐多绿来一杯 is more possible\n",
      "---- 洋葱奶昔来一杯 with probility 1.9840875058382383e-20\n",
      "---- 养乐多绿来一杯 with probility 7.3289295697906e-14\n"
     ]
    }
   ],
   "source": [
    "# 比较两个句子出现的概率大小\n",
    "need_compared = [\n",
    "    \"今天晚上请你吃大餐，我们一起吃日料 明天晚上请你吃大餐，我们一起吃苹果\",\n",
    "    \"真事一只好看的小猫 真是一只好看的小猫\",\n",
    "    \"今晚我去吃火锅 今晚火锅去吃我\",\n",
    "    \"洋葱奶昔来一杯 养乐多绿来一杯\"\n",
    "]\n",
    "\n",
    "for s in need_compared:\n",
    "    s1, s2 = s.split()\n",
    "    p1, p2 = get_probablity(s1), get_probablity(s2)\n",
    "    \n",
    "    better = s1 if p1 > p2 else s2\n",
    "    \n",
    "    print('{} is more possible'.format(better))\n",
    "    print('-'*4 + ' {} with probility {}'.format(s1, p1))\n",
    "    print('-'*4 + ' {} with probility {}'.format(s2, p2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 完成以下问答和编程练习"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  基础理论部分"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 0. Can you come up out 3 sceneraies which use AI methods? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: {人脸识别、语音助手、智能对话}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. How do we use Github; Why do we use Jupyter and Pycharm;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans: {\n",
    "1. Github：远程代码仓库，保存不同代码版本，便于分享和协作；\n",
    "2. Jupyter：基于服务器-客户端结构的网页应用，局部代码即时运行，交互性效果最好，支持MarkDown注释和绘图展示；\n",
    "3. Pycharm：Python专用集成开发环境，安装配置简单，功能支持全面。}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. What's the Probability Model?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans:概率模型是描述不同随机变量之间关系的数学模型，通常情况下刻画了一个或多个随机变量之间的相互非确定性的概率关系。\n",
    "从数学上讲，该模型通常被表达为一个概率分布函数或密度函数的集合(Y,P)，其中 Y 是观测集合用来描述可能的观测结果， P 是 Y 对应的概率分布函数集合。\n",
    "\n",
    "若使用概率模型，一般而言需假设存在一个确定的分布P 生成观测数据 Y 。因此通常使用统计推断的办法确定集合 P 中谁是数据产生的原因。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. Can you came up with some sceneraies at which we could use Probability Model?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans:{抛N次硬币出现正面朝上的次数、人类身高的统计分布、财富在人群中的统计分布}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4. Why do we use probability and what's the difficult points for programming based on parsing and pattern match?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans:\n",
    "1. 概率的随机性允许模糊和不确定，概率计算符合大规模的统计结果；互联网大数据的积累为概率统计方法提供了可能。\n",
    "2. 句法分析和模式匹配的困难：自然语言中包含大量歧义，具有模糊性和不确定性；语法规则的开发复杂，并且并不完备。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 5. What's the Language Model;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans:概率语言模型，描述基于大规模语料库，如何计算一条语句出现的概率。\n",
    "典型的N-Gram模型，基于N-1阶马尔可夫链,认为当前词仅与前N-1个词有关,这就解决了维数灾难这个问题。\n",
    "基于条件概率和马尔科夫独立性假设，一条语句出现的概率等于其所有相邻N个词出现的条件概率的连乘积。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 6. Can you came up with some sceneraies at which we could use Language Model?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans:智能对话、语音识别、机器翻译"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 7. What's the 1-gram language model;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans:一元语言模型中，一条语句出现的概率定义为其中所有词出现概率的连乘积。\n",
    "基于条件无关假设，即认为每个词都是条件无关的"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 8. What's the disadvantages and advantages of 1-gram language model;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans:缺点：不能判断词之间的上下文关系。\n",
    "优点：计算简单。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 9. What't the 2-gram models;"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans:二元语言模型中，一条语句出现的概率定义为其中所有相邻两个词出现概率的连乘积。\n",
    "基于条件概率和马尔科夫独立性假设。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 编程实践部分"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1. 设计你自己的句子生成器"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如何生成句子是一个很经典的问题，从1940s开始，图灵提出机器智能的时候，就使用的是人类能不能流畅和计算机进行对话。和计算机对话的一个前提是，计算机能够生成语言。\n",
    "\n",
    "计算机如何能生成语言是一个经典但是又很复杂的问题。 我们课程上为大家介绍的是一种基于规则（Rule Based）的生成方法。该方法虽然提出的时间早，但是现在依然在很多地方能够大显身手。值得说明的是，现在很多很实用的算法，都是很久之前提出的，例如，二分查找提出与1940s, Dijstra算法提出于1960s 等等。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在著名的电视剧，电影《西部世界》中，这些机器人们语言生成的方法就是使用的SyntaxTree生成语言的方法。\n",
    "\n",
    "> \n",
    ">\n",
    "\n",
    "![WstWorld](https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1569578233461&di=4adfa7597fb380e7cc0e67190bbd7605&imgtype=0&src=http%3A%2F%2Fs1.sinaimg.cn%2Flarge%2F006eYYfyzy76cmpG3Yb1f)\n",
    "\n",
    "> \n",
    ">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这一部分，需要各位同学首先定义自己的语言。 大家可以先想一个应用场景，然后在这个场景下，定义语法。例如：\n",
    "\n",
    "在西部世界里，一个”人类“的语言可以定义为：\n",
    "``` \n",
    "human = \"\"\"\n",
    "human = 自己 寻找 活动\n",
    "自己 = 我 | 俺 | 我们 \n",
    "寻找 = 看看 | 找找 | 想找点\n",
    "活动 = 乐子 | 玩的\n",
    "\"\"\"\n",
    "```\n",
    "\n",
    "一个“接待员”的语言可以定义为\n",
    "```\n",
    "host = \"\"\"\n",
    "host = 寒暄 报数 询问 业务相关 结尾 \n",
    "报数 = 我是 数字 号 ,\n",
    "数字 = 单个数字 | 数字 单个数字 \n",
    "单个数字 = 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 \n",
    "寒暄 = 称谓 打招呼 | 打招呼\n",
    "称谓 = 人称 ,\n",
    "人称 = 先生 | 女士 | 小朋友\n",
    "打招呼 = 你好 | 您好 \n",
    "询问 = 请问你要 | 您需要\n",
    "业务相关 = 玩玩 具体业务\n",
    "玩玩 = 耍一耍 | 玩一玩\n",
    "具体业务 = 喝酒 | 打牌 | 打猎 | 赌博\n",
    "结尾 = 吗？\"\"\"\n",
    "\n",
    "```\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "请定义你自己的语法: "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "第一个语法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 94,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T11:58:09.832804Z",
     "start_time": "2020-11-29T11:58:09.822811Z"
    }
   },
   "outputs": [],
   "source": [
    "you_need_replace_this_with_name_you_given = '''\n",
    "# you code here\n",
    "'''\n",
    "# 参考\n",
    "poem = '''\n",
    "sentence => sentence1 sentence1 sentence2 sentence2\n",
    "sentence1 => adj_phrase noun_phrase verb_phrase noun_phrase punctuation\n",
    "sentence2 => noun verb_phrase noun_phrase adj_phrase noun punctuation\n",
    "adj_phrase => num unit\n",
    "noun_phrase => adj noun            \n",
    "verb_phrase => verb\n",
    "num => 一 | 二 | 三 | 两 | 千 | 万 \n",
    "unit => 行 | 只 | 个 | 声 | 里 | 秋 | 冬\n",
    "adj =>  白 | 黄 | 翠 | 青 | 西 | 东 | 北 | 南 \n",
    "noun =>   鹭 |  鹂 | 柳 | 天 | 岭 | 窗 | 雪  | 门 | 吴 | 船\n",
    "verb => 鸣 | 上 | 含 |  泊\n",
    "punctuation => ，| 。| ? | ！\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **评阅点**： 是否提出了和课程上区别较大的语法结构"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "第二个语法："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T11:58:11.081886Z",
     "start_time": "2020-11-29T11:58:11.067896Z"
    }
   },
   "outputs": [],
   "source": [
    "you_need_replace_this_with_name_you_given = '''\n",
    "# you code here\n",
    "'''\n",
    "# 参考\n",
    "dynast = '''\n",
    "sentence => dy1 dy2 dy3\n",
    "dy1 => verb adverb punctuation\n",
    "dy2 => adj_phrase noun_phrase punctuation\n",
    "dy3 => noun_phrase adverb noun_phrase adj punctuation\n",
    "adj_phrase => num unit\n",
    "noun_phrase => adj noun\n",
    "verb => 念 | 道 | 悲 | 忆\n",
    "adverb => 去去 | 沉沉 | 呜呼 | 呼哉 | 凄凄\n",
    "num => 千 | 万 | 双\n",
    "unit => 行 | 古 | 里\n",
    "adj =>  烟 | 暮 | 楚 | 阔\n",
    "noun =>   波 | 霭 | 天 | 雪 | 船\n",
    "punctuation => ，| 。| ? | ！\n",
    "'''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **评阅点**：是否和上一个语法区别比较大"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO: 然后，使用自己之前定义的generate函数，使用此函数生成句子。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T11:58:13.237069Z",
     "start_time": "2020-11-29T11:58:13.211085Z"
    }
   },
   "outputs": [],
   "source": [
    "# （1.2）根据语法描述 grammar_str 生成规则 grammar\n",
    "def create_grammar(grammar_str, split='=>', line_split='\\n'):\n",
    "    grammar = {}\n",
    "    for line in grammar_str.split(line_split):\n",
    "        if not line.strip(): continue\n",
    "        exp, stmt = line.split(split)\n",
    "        grammar[exp.strip()] = [s.split() for s in stmt.split('|')]\n",
    "    return grammar\n",
    "\n",
    "# （3.1）根据（句子）语法规则生成句子\n",
    "choice = random.choice\n",
    "\n",
    "def generate(gram, target):\n",
    "    if target not in gram: return target # means target is a terminal expression #1\n",
    "    \n",
    "    expaned = [generate(gram, t) for t in choice(gram[target])]  #2\n",
    "    return ''.join([e if e != '/n' else '\\n' for e in expaned if e != 'null']) #3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 97,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T11:58:14.006109Z",
     "start_time": "2020-11-29T11:58:13.984124Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'两个黄鹂含白天?三行南岭鸣北吴?吴上南门万冬岭！鹭泊白鹭三个鹭，'"
      ]
     },
     "execution_count": 97,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate(gram = create_grammar(poem, split='=>'), target='sentence')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO: 然后，定义一个函数，generate_n，将generate扩展，使其能够生成n个句子:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T11:59:34.106297Z",
     "start_time": "2020-11-29T11:59:34.088308Z"
    }
   },
   "outputs": [],
   "source": [
    "def generate_n(num):\n",
    "    # you code here\n",
    "    for i in range(num):\n",
    "        print(generate(gram = create_grammar(poem, split='=>'), target='sentence'))\n",
    "        print(generate(gram = create_grammar(dynast, split='=>'), target='sentence'))\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T11:59:45.577129Z",
     "start_time": "2020-11-29T11:59:45.563135Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "千冬翠吴上南天。三里翠船含南窗?柳含白柳万冬门。窗上南鹭万冬船。\n",
      "悲凄凄?万行暮波。阔天去去阔波烟！\n",
      "千个白鹭含翠门?千只南吴泊南天，门含翠吴三秋雪，鹂泊南柳二冬鹂。\n",
      "悲去去，万古暮天。阔霭去去暮霭楚?\n",
      "二秋翠船含白天！万行西门鸣南吴。雪泊翠柳千行鹂！岭泊南船千里柳，\n",
      "念呜呼，万行烟船。烟天呼哉烟雪阔。\n",
      "三秋北雪含西柳。二只青柳上白窗?窗含北柳二秋天，鹭含青雪三个门，\n",
      "道沉沉，千里暮雪！阔船去去楚霭烟。\n",
      "两个青窗含南雪?两里西鹭泊翠鹭，船含青吴两只岭！雪鸣黄雪两秋窗，\n",
      "道呼哉！千古暮霭?楚天沉沉暮船暮?\n"
     ]
    }
   ],
   "source": [
    "generate_n(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **评阅点**; 运行代码，观察是否能够生成多个句子"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2. 使用新数据源完成语言模型的训练"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "按照我们上文中定义的`prob_2`函数，我们更换一个文本数据源，获得新的Language Model:\n",
    "\n",
    "1. 下载文本数据集（你可以在以下数据集中任选一个，也可以两个都使用）\n",
    "    + 可选数据集1，保险行业问询对话集： https://github.com/Computing-Intelligence/insuranceqa-corpus-zh/raw/release/corpus/pool/train.txt.gz\n",
    "    + 可选数据集2：豆瓣评论数据集：https://github.com/Computing-Intelligence/datasource/raw/master/movie_comments.csv\n",
    "2. 修改代码，获得新的**2-gram**语言模型\n",
    "    + 进行文本清洗，获得所有的纯文本\n",
    "    + 将这些文本进行切词\n",
    "    + 送入之前定义的语言模型中，判断文本的合理程度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 101,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:09:49.156637Z",
     "start_time": "2020-11-29T12:09:47.621226Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "H:\\Anaconda3\\anzhuang1\\lib\\site-packages\\IPython\\core\\interactiveshell.py:3049: DtypeWarning: Columns (0,4) have mixed types. Specify dtype option on import or set low_memory=False.\n",
      "  interactivity=interactivity, compiler=compiler, result=result)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>link</th>\n",
       "      <th>name</th>\n",
       "      <th>comment</th>\n",
       "      <th>star</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>https://movie.douban.com/subject/26363254/</td>\n",
       "      <td>战狼2</td>\n",
       "      <td>吴京意淫到了脑残的地步，看了恶心想吐</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>https://movie.douban.com/subject/26363254/</td>\n",
       "      <td>战狼2</td>\n",
       "      <td>首映礼看的。太恐怖了这个电影，不讲道理的，完全就是吴京在实现他这个小粉红的英雄梦。各种装备轮...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>https://movie.douban.com/subject/26363254/</td>\n",
       "      <td>战狼2</td>\n",
       "      <td>吴京的炒作水平不输冯小刚，但小刚至少不会用主旋律来炒作…吴京让人看了不舒服，为了主旋律而主旋...</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>https://movie.douban.com/subject/26363254/</td>\n",
       "      <td>战狼2</td>\n",
       "      <td>凭良心说，好看到不像《战狼1》的续集，完虐《湄公河行动》。</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>https://movie.douban.com/subject/26363254/</td>\n",
       "      <td>战狼2</td>\n",
       "      <td>中二得很</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  id                                        link name  \\\n",
       "0  1  https://movie.douban.com/subject/26363254/  战狼2   \n",
       "1  2  https://movie.douban.com/subject/26363254/  战狼2   \n",
       "2  3  https://movie.douban.com/subject/26363254/  战狼2   \n",
       "3  4  https://movie.douban.com/subject/26363254/  战狼2   \n",
       "4  5  https://movie.douban.com/subject/26363254/  战狼2   \n",
       "\n",
       "                                             comment star  \n",
       "0                                 吴京意淫到了脑残的地步，看了恶心想吐    1  \n",
       "1  首映礼看的。太恐怖了这个电影，不讲道理的，完全就是吴京在实现他这个小粉红的英雄梦。各种装备轮...    2  \n",
       "2  吴京的炒作水平不输冯小刚，但小刚至少不会用主旋律来炒作…吴京让人看了不舒服，为了主旋律而主旋...    2  \n",
       "3                      凭良心说，好看到不像《战狼1》的续集，完虐《湄公河行动》。    4  \n",
       "4                                               中二得很    1  "
      ]
     },
     "execution_count": 101,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#读取文件\n",
    "filename = 'movie_comments.csv'\n",
    "content = pd.read_csv(filename, encoding='utf-8')\n",
    "content.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:11:48.907820Z",
     "start_time": "2020-11-29T12:11:46.304349Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "261497\n",
      "261497\n"
     ]
    }
   ],
   "source": [
    "# 提取词并写入文件\n",
    "articles = content['comment'].tolist()\n",
    "print(len(articles))\n",
    "\n",
    "def token(string):\n",
    "    # we will learn the regular expression next course.\n",
    "    return re.findall('\\w+', string)\n",
    "\n",
    "articles_clean = [''.join(token(str(a)))for a in articles]\n",
    "print(len(articles_clean))\n",
    "\n",
    "with open('article_movie_comments.txt', 'w', encoding='utf-8') as f:\n",
    "    for a in articles_clean:\n",
    "        f.write(a + '\\n')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:13:01.466677Z",
     "start_time": "2020-11-29T12:12:10.952724Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "10000\n",
      "20000\n",
      "30000\n",
      "40000\n",
      "50000\n",
      "60000\n",
      "70000\n",
      "80000\n",
      "90000\n",
      "100000\n"
     ]
    }
   ],
   "source": [
    "# 分词\n",
    "def cut(string): return list(jieba.cut(string))\n",
    "\n",
    "TOKEN = []\n",
    "for i, line in enumerate((open('article_movie_comments.txt','r',encoding='utf-8'))):\n",
    "    if i % 10000 == 0: print(i)\n",
    "    # replace 10000 with a big number when you do your homework. \n",
    "    if i > 100000: break    \n",
    "    TOKEN += cut(line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:13:43.012973Z",
     "start_time": "2020-11-29T12:13:39.508872Z"
    }
   },
   "outputs": [],
   "source": [
    "# 计算概率\n",
    "words_count = Counter(TOKEN)\n",
    "\n",
    "TOKEN = [str(t) for t in TOKEN]\n",
    "TOKEN_2_GRAM = [''.join(TOKEN[i:i+2]) for i in range(len(TOKEN[:-2]))]\n",
    "words_count_2 = Counter(TOKEN_2_GRAM)\n",
    "\n",
    "def prob_1(word):\n",
    "    return words_count[word] / len(TOKEN)\n",
    "# count(wk)/(number of words)\n",
    "\n",
    "def prob_2(word1, word2):  # p(w1,w2) = count(w1,2)/count(w1)\n",
    "    if word1 + word2 in words_count_2: return words_count_2[word1+word2] / words_count[word1]\n",
    "    else:\n",
    "        return 1 / len(TOKEN_2_GRAM)\n",
    "    \n",
    "#  (w1 w2), (w3,w4) (w4,w5)  2-gram\n",
    "# (w1,w3)  1/3\n",
    "\n",
    "def get_probablity(sentence):\n",
    "    words = cut(sentence)\n",
    "    \n",
    "    sentence_pro = 1\n",
    "    \n",
    "    for i, word in enumerate(words[:-1]):\n",
    "        next_ = words[i+1]\n",
    "        \n",
    "        probability = prob_2(word, next_)  # p(w1|w2)\n",
    "        \n",
    "        sentence_pro *= probability  # p(s) = p(w_1)p(w2|w1)*p(w3|w2)..p(wn|wn-1) \n",
    "    \n",
    "    return sentence_pro"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:14:08.282041Z",
     "start_time": "2020-11-29T12:14:08.260059Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "9.70386170001317e-34"
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 测试句子\n",
    "get_probablity('小明今天抽奖抽到一台苹果手机')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:14:08.994974Z",
     "start_time": "2020-11-29T12:14:08.981982Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.8457615165129006e-38"
      ]
     },
     "execution_count": 106,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_probablity('小明今天抽奖抽到一架波音飞机')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:14:09.830271Z",
     "start_time": "2020-11-29T12:14:09.817285Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1.3585880599037004e-19"
      ]
     },
     "execution_count": 107,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_probablity('洋葱奶昔来一杯')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:14:10.632252Z",
     "start_time": "2020-11-29T12:14:10.611267Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "2.6427648931784153e-13"
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_probablity('养乐多绿来一杯')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:14:26.556074Z",
     "start_time": "2020-11-29T12:14:26.537087Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sentence: 这个小猫看着这个小猫 with Prb: 6.98420628061632e-26\n",
      "sentence: 这个桌子坐在一个蓝色的蓝色的蓝色的好看的桌子 with Prb: 1.566360165405667e-41\n",
      "sentence: 一个桌子看着这个好看的桌子 with Prb: 5.552531899614806e-27\n",
      "sentence: 这个蓝色的小猫坐在一个好看的好看的好看的小小的桌子 with Prb: 6.817845390078086e-45\n",
      "sentence: 这个好看的篮球看见这个小猫 with Prb: 5.480375076732864e-23\n",
      "sentence: 这个好看的小小的蓝色的好看的篮球看见一个桌子 with Prb: 7.125869559280924e-38\n",
      "sentence: 这个女人看见这个好看的篮球 with Prb: 1.7827064431668885e-19\n",
      "sentence: 一个女人看见一个蓝色的好看的篮球 with Prb: 1.6148172347414713e-26\n",
      "sentence: 这个桌子看见这个桌子 with Prb: 2.11179491176741e-21\n",
      "sentence: 这个篮球看见这个蓝色的蓝色的蓝色的女人 with Prb: 2.2592706163574285e-34\n"
     ]
    }
   ],
   "source": [
    "for sen in [generate(gram=example_grammar, target='sentence') for i in range(10)]:\n",
    "    print('sentence: {} with Prb: {}'.format(sen, get_probablity(sen)))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:14:46.368015Z",
     "start_time": "2020-11-29T12:14:46.348030Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "今天晚上请你吃大餐，我们一起吃日料 is more possible\n",
      "---- 今天晚上请你吃大餐，我们一起吃日料 with probility 2.0720365197865112e-42\n",
      "---- 明天晚上请你吃大餐，我们一起吃苹果 with probility 4.510828857674769e-43\n",
      "真是一只好看的小猫 is more possible\n",
      "---- 真事一只好看的小猫 with probility 2.5332326717830002e-21\n",
      "---- 真是一只好看的小猫 with probility 8.199229747670979e-19\n",
      "今晚我去吃火锅 is more possible\n",
      "---- 今晚我去吃火锅 with probility 8.562263341759753e-11\n",
      "---- 今晚火锅去吃我 with probility 3.881253468807565e-18\n",
      "养乐多绿来一杯 is more possible\n",
      "---- 洋葱奶昔来一杯 with probility 1.3585880599037004e-19\n",
      "---- 养乐多绿来一杯 with probility 2.6427648931784153e-13\n"
     ]
    }
   ],
   "source": [
    "need_compared = [\n",
    "    \"今天晚上请你吃大餐，我们一起吃日料 明天晚上请你吃大餐，我们一起吃苹果\",\n",
    "    \"真事一只好看的小猫 真是一只好看的小猫\",\n",
    "    \"今晚我去吃火锅 今晚火锅去吃我\",\n",
    "    \"洋葱奶昔来一杯 养乐多绿来一杯\"\n",
    "]\n",
    "\n",
    "for s in need_compared:\n",
    "    s1, s2 = s.split()\n",
    "    p1, p2 = get_probablity(s1), get_probablity(s2)\n",
    "    \n",
    "    better = s1 if p1 > p2 else s2\n",
    "    \n",
    "    print('{} is more possible'.format(better))\n",
    "    print('-'*4 + ' {} with probility {}'.format(s1, p1))\n",
    "    print('-'*4 + ' {} with probility {}'.format(s2, p2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **评阅点** 1. 是否使用了新的数据集； 2. csv(txt)数据是否正确解析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3. 获得最优质的的语言"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "当我们能够生成随机的语言并且能判断之后，我们就可以生成更加合理的语言了。请定义 generate_best 函数，该函数输入一个语法 + 语言模型，能够生成**n**个句子，并能选择一个最合理的句子: \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "提示，要实现这个函数，你需要Python的sorted函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[1, 2, 3, 5]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted([1, 3, 5, 2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这个函数接受一个参数key，这个参数接受一个函数作为输入，例如"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(1, 4), (2, 5), (4, 4), (5, 0)]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted([(2, 5), (1, 4), (5, 0), (4, 4)], key=lambda x: x[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "能够让list按照第0个元素进行排序."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(5, 0), (1, 4), (4, 4), (2, 5)]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted([(2, 5), (1, 4), (5, 0), (4, 4)], key=lambda x: x[1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "能够让list按照第1个元素进行排序."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(2, 5), (1, 4), (4, 4), (5, 0)]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sorted([(2, 5), (1, 4), (5, 0), (4, 4)], key=lambda x: x[1], reverse=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "能够让list按照第1个元素进行排序, 但是是递减的顺序。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:19:52.120120Z",
     "start_time": "2020-11-29T12:19:52.095134Z"
    }
   },
   "outputs": [],
   "source": [
    "def generate_best(grammar_string, num): # you code here\n",
    "    sentences = []\n",
    "    for i in range(num):\n",
    "        # 生成句子\n",
    "        sentence = generate(gram=create_grammar(grammar_string, split='=>'), target='sentence')\n",
    "        # 计算概率\n",
    "        probability = get_probablity(sentence)\n",
    "        sentences.append((sentence, probability))\n",
    "    # 按概率降序排序\n",
    "    sorted(sentences, key=lambda x: x[1], reverse=True)\n",
    "    return sentences[0]\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-29T12:22:17.504474Z",
     "start_time": "2020-11-29T12:22:17.463501Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('三里黄鹂上青门！万只西雪泊西吴?岭鸣青鹂万里柳！柳上翠船二秋雪?', 6.2882059989725445e-114)"
      ]
     },
     "execution_count": 118,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "generate_best(poem, 20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "好了，现在我们实现了自己的第一个AI模型，这个模型能够生成比较接近于人类的语言。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **评阅点**： 是否使用 lambda 语法进行排序"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Q: 这个模型有什么问题？ 你准备如何提升？ "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ans:\n",
    "（1）生成句子的语法规则太简单，规则中的词太少；\n",
    "（2）概率整体都很小，由于生成的句子是四言古诗，概率语言模型使用的语料库应该是唐诗三百首之类。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">**评阅点**: 是否提出了比较实际的问题，例如OOV问题，例如数据量，例如变成 3-gram问题。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### 以下内容为可选部分，对于绝大多数同学，能完成以上的项目已经很优秀了，下边的内容如果你还有精力可以试试，但不是必须的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 4. (Optional) 完成基于Pattern Match的语句问答\n",
    "> 另外一份作业文件里有个optional，有兴趣的同学可以挑战一下"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "各位同学，我们已经完成了自己的第一个AI模型，大家对人工智能可能已经有了一些感觉，人工智能的核心就是，我们如何设计一个模型、程序，在外部的输入变化的时候，我们的程序不变，依然能够解决问题。人工智能是一个很大的领域，目前大家所熟知的深度学习只是其中一小部分，之后也肯定会有更多的方法提出来，但是大家知道人工智能的目标，就知道了之后进步的方向。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后，希望大家对AI不要有恐惧感，这个并不难，大家加油！"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    ">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1561828422005&di=48d19c16afb6acc9180183a6116088ac&imgtype=0&src=http%3A%2F%2Fb-ssl.duitang.com%2Fuploads%2Fitem%2F201807%2F28%2F20180728150843_BECNF.thumb.224_0.jpeg)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
